Readability
What do you do when you want to keep online content that's password protected, offline?
For the past year or two, A shortcut to add to iA Writer worked fantastically! Except that all the sudden the sites I was using it on just stopped working.
Naturally, that requires me to figure out how to do it myself...and on a mac.
Step 1. Lookup html to markdown
And you'll find a ton of examples of libraries for PHP, JavaScript, and Python. So, maybe a serverless function?
Step 2. Go too far down a rabbit hole only to realize it doesn't work.
It's a challenge to pass direct HTML to a serverless function. It's more common to receive a URL and then fetch it within the function (less bits over the wire). But, then how do you get past the password protection? You don't. Or at least the effort to do so is not worth it.
Step 3. Node CLI?
I found a couple of tutorials for serverless functions that retrieve the contents of a URL, run it through the readability library from mozilla and then have pandoc
convert from HTML to markdown. Sounds reasonable.
Except that a Node CLI is clearly dark magic that is impossible for mere humans to comprehend.
Step 4. Sweet, sweet rust
A quick !crates readability
(because duckduckgo bangs are awesome) and there's a rust version! Finally, something that makes sense. And the final result is barely worth an entire cargo project.
use clap::Clap;
use readability::extractor::extract;
use std::fs::File;
use std::io::{self, BufReader, Write};
use std::path::PathBuf;
/// HTML to Readability CLI
#[derive(Clap)]
#[clap(version = "1.0")]
struct Args {
file: PathBuf,
url: url::Url,
}
fn main() -> Result<(), anyhow::Error> {
let opts: Args = Args::parse();
let file = File::open(&opts.file)?;
let mut reader = BufReader::new(file);
let product = extract(&mut reader, &opts.url)?;
let stdout = io::stdout();
let mut handle = stdout.lock();
handle.write_all(format!("<h1>{}</h1>", product.title).as_bytes())?;
handle.write_all(product.content.as_bytes())?;
handle.write_all(
format!(r#"<p>Source: <a href="{}">{}</a>"#, opts.url, product.title).as_bytes(),
)?;
Ok(())
}
That's it. If you go with stdlib error handling and arguments, it's probably even less lines.
Alfred/Applescript Magic
So that's wonderful. We've got our stripped HTML and piping to pandoc
is easy. But what's the most accessible way to access this? Alfred.
command + space
html2md + enter
Done 💥
And the magic behind it all
content=$(osascript -e 'tell application "Safari" to return the source of front document')
url=$(osascript -e 'tell application "Safari" to return the URL of front document')
cd <reader location>/target/release/
./reader <(echo $content) $url | /usr/local/bin/pandoc -f html -t commonmark-raw_html --wrap none
# <() is really cool. Treats stdin as a file
Retrieves the html and url from the current tab in Safari and runs the readability script. Then pipes to pandoc. This then gets passed to iA Writer and saved to a file (This is only necessary because I love that iA Writer auto-titles the file).