Archiving News Articles

For historical purposes, I think it's valuable to archive newspaper articles. In the past, this was done quite simply with a pair of scissors. Cut out the article and put it into a scrapbook or a folder.

Nowdays, much of the news is found online and is surrounded by advertising, tracking scripts, videos, audio, and other details that don't related to the content itself. All of this makes it hard to create a single, standalone archive of the newspaper article. Here are a couple tips for creating an archive of a newspaper article.

Start With the Rendered HTML

When you're viewing the website, start by capturing the fully rendered HTML document. This can be done by running the following script in the JavaScript console.

copy(document.getElementsByTagName('html')[0].outerHTML)

Run the script, open a text editor, and paste the HTML into the document. Name the file article.html. Then open article.html in the browser and see what it looks like.

Another technique is to click "Save As" on the website and download a copy of the website to your computer. This will also includes external files which is not ideal for a standalone archive - but it is a good starting point for creating a standalone archive.

Prune Unneeded Tags

Now that you have article.html, you can try pruning different html tags, either manually or using a script. For example, if you run the following script in the JavaScript console, you can strip noscript, link, script, and iframe tags.

Array.prototype.forEach.apply(document.querySelectorAll('noscript,link,script,iframe'), [x => { console.log(x); x.remove() }])

After running the script, re-run the HTML capture script, re-save the document, and see if it still looks ok. Iterate on this process until you're removed much of the unneeded markup.

Monitor the Network

If you go to the "Network" tab of the developer conosle, you can see what external resources are being requested. If you see more than the HTML document, do a search and try to find the tag that is requesting the external resource. You can then strip it out or replace it with inline content.