An upcoming event at the Reynolds Journalism Institute Dodging the Memory Hole: Saving Born-digital News Content (following on their Newspaper Archive Summit) looks like a great opportunity to bring together people who aren’t regularly in contact to think about preserving online news content. As the website explains:
One recent survey found that most American media enterprises fail to adequately process their born-digital news content for long-term survival. This potential disappearance of news, birth announcements, obituaries and feature stories represents an impending loss of cultural heritage and identity for communities and the nation at large: a kind of Orwellian “memory hole” of our own unintentional making.
The situation with news websites reminds me in many ways of the situation with university presses: each publisher is so preoccupied with trying to make ends meet that they can’t afford to invest in solutions that will help preserve their content. Preserving content is not only altruistic (ensuring that it’s available for historians) but also has the potential to lead to increased revenue—”monetizing your backlist” by keeping your products available for sale for longer and/or repackaging them in new ways.
However, in this regard, news organizations that publish online have a significant advantage over university presses for digital preservation: they generally already use a content management system not only for publishing but for managing all stages of the editing process. Therefore, their content is already in a rich, fairly standard format within this CMS. Compare this with university presses, in which the common format for their products is most often only PDF files.
An organization operating a CMS has an opportunity to integrate archiving into the act of publishing or updating content within the CMS. Instead of archiving after the fact of publication (as some scholarly publishers do when delivering content to Portico), the CMS could automatically send a copy of each newly created or newly revised webpage to a trusted archive of the content that would only make the content available in case of “trigger event”. Such an archive would be a third party trusted by competing publishers, like Portico, and would need to have an automated mechanism for receiving and archiving content. Unlike pure web crawling efforts, this could catch every revision of an article, and it could include content behind a paywall.
The staff at Portico do quite a bit to normalize scholarly content they receive into a standard XML format suitable for preservation, but since there isn’t an equivalent standard, archival-quality format for online news content, the CMS would likely have to deliver the HTML source, plus embedded media files, “as is”. Still, HTML is far from the worst content for preservation (web browsers do an excellent job of rendering nearly any HTML document, even those created many years ago), and non-HTML content would be in no worse shape than if no one had saved it at all.
So we just need to create the trusted third-party archive and create plugins for automatic deposit into that archive for the various CMSes in use. Easy!