Personal web archives; a report

My goal over the past spring and summer was to enable people to store web pages and create their personal web archives. Hereby an overdue report of six months of undisturbed coding.

I will start with an overview of the approach and architectural design as it evolved along the way, then elaborate on what has been implemented thus far.

Approach

In this project, the angle to archiving differed a bit from established projects, such as the widely respected Internet Archive. These archives conceptualise the web as a global living library, that (desperately) needs a permanent record to counteract its continuous decay and keep the previous content of web sites available. While such a versioned repository of nearly the whole web is a very important and useful endeavour, I have tried to address three other aspects:

Firstly, for more resilient and self-organised archiving, it would be great if people can own and keep the copies of the documents they care about, thereby creating a highly redundant, localised, distributed storage.
Secondly, acknowledging the web consists not only of public, static pages, people should be able to archive the web as it looked from their perspective, including private or personalised views (accordingly, archived content should be private by default).
Thirdly, I'd like to take into account the fact that many web 'pages' are in fact more like an application than a document, influenced by user interaction and other resources it fetches from the internet.

With these aspects in mind, the goal has not been to crawl and capture web pages with all their complexity and interactivity, but rather to enable people to store any page exactly as they see it in their browser. Not recording the web as if it were a living thing, but just making web pages more durable objects.

Even a personalised, ephemeral view that was generated on the fly (e.g. a hotel booking confirmation) is considered a document that one could store just like a paper letter. And any interactive application (e.g. a zoomable map, or simply a page with pop-ups) is considered a sequence of document variations, any of which can be stored as a simple, non-interactive document by snapshotting the application's current presentation.

By keeping viewed documents as you browse, you could page by page grow your personal web archive; a personal selection of things worth keeping, not unlike one's collection of books on a shelf.

Now an interesting additional concept is that the snapshots of web pages could themselves again be made part of the web. You could then access your archive with your browser, which could let you access both the live and archived versions of a page when visiting a link or bookmark. Moreover you could share parts of your archive on the web, so others could access them too.

Hopefully, the fragility of the web could be somewhat compensated for by keeping your own copies of valuable pages and querying others for theirs, instead of relying on the original publishers to keep hosting them forever.

Designing for individuals

Throughout this project (which, to be clear, is far from finished), the design philosophy has been to think bottom-up: primarily, the purpose is to create tools for the needs of individuals. Then, only as a secondary, long-term objective, a sizeable distributed web archive might emerge from that through people's collective behaviour.

Another strong intention has been to let people use the various means they may already have, mixing and matching different tools as it suits them. Some may want to save web pages using our tool, but want to do full-text search using their usual desktop search tool. They may want to share snapshots by email, rather than via their personal archive.

The challenge is to not design a monolith catering for pre-envisioned scenarios, but make primitive tools that are easily composable with others to create different workflows. While working on this project, I realised the desired functionality could nicely be divided in three independent but complementary tools:

a browser extension for storing web pages,
a web archive server hosting such snapshots,
a(nother) browser extension to query such archives.

Tool 1: Storing a web page as a file

To start with, it may be best to forget the idea of growing a personal archive, and just wonder how a browser (extension) could let an individual keep a single page. As explained, the idea is to capture the page's state on the moment the user chooses to save it, but another question is how to deliver this capture in a useful form.

As the principal unit of knowledge management on computers is a file, having each snapshot consist of a single file would enable people to use decades of existing software for organising them, sending them to others, making back-ups, and so on.

A file feels almost tangible when compared to an object that is kept in the custom database of the application that created it (it's sad how the latter is becoming the norm). This is even more so if the file can be opened by various existing applications. Since most people's computers, including 'smart'phones, can display an HTML file, this format felt like the best choice for this project (even though formats like WARC or MHTML may seem more likely choices for web archives). All required resources (images, stylesheets...), as well as any meta-data (date of snapshot, the page's URL..) will have to be stored inside the file itself. Luckily, with HTML this is possible; details about that further below.

An encouraging observation is that saving static snapshots of web pages as files is something people already do, often with the limited means that are commonly available: making screenshots or saving them as PDFs. Even if it is a low-quality copy that lost much of the page's layout, structure and links, people simply want to keep hold of what they saw, locally. When asking around among journalists, academics, techies, and other frequent users of the web, it keeps amazing me how many of them do this. Saving a page as an HTML file, thus retaining its natural format, seems a small but useful step forward.

Save Page As...

Time for a small intermezzo, as when discussing the creation of a browser extension that just saves pages as html files, a savvy reader probably realises such a feature exists already: browsers have always had a button to save the currently viewed page.

But almost nobody uses this feature, because it has been mostly dysfunctional since medieval times (perhaps back then they worked fine, because the pages where primitive too). At least in Firefox and Chromium*, saving a page results in an HTML file, next to a folder with (some of) its subresources. In most situations, the result is..

hard to move around, or e.g. attach to an email, as the file and folder must remain together.
of bad quality: often not all subresources are stored, so pages look malformed as they lack some images or stylesheets.
unreliable: some subresources are not captured, but are still fetched from their original source, thus falsely suggesting they were stored when first viewing the snapshot.
unpredictable: when viewed, the page's scripts are re-executed in the already rendered page, which might lead to.. anything really, depending on whether the script was written with presumptions about the page content.
possibly insecure: executing scripts might lead to security problems, or leak private information to the internet.
lastly, the saved file has no notion of where it came from, dissociating it from its original URL.

Boring as it sounds, while I started by thinking how to create a decentralised web archive, I slowly realised the most valuable step towards it might just be to fix the 'Save Page As...' button. Hopefully browsers will one day give this forgotten feature some love again, and until then browser extensions can help fill the gap.

Tool 2: Hosting an archive

Assuming we have mostly solved the problem of saving web pages, such that people have the ability to turn them into tangible files, we can build the next level of functionality. We can make a tool that interprets a bunch of snapshot files as one's personal web archive.

This archive would be a static web server hosting one's snapshots. Moreover, it would support Memento, a protocol supported by many archives to enable users to query the archive with a URL and a date, and receive a redirect to the snapshot of that URL closest to that date (if any exists).

The archive could be running locally on one's own computer and be solely for individual use, or on a remote server with the option to also provide other people access. Besides the basic features, it could provide (human or programming) interfaces to, for example, add new snapshots to the archive, control their access, or perform full-text search on them.

Tool 3: Querying the archive(s)

With the browser's augmented capability to save web pages, and an archive server for providing access to the snapshots, the remaining piece of the puzzle returns our attention to the browser, in order to make it automatically connect to and make use of the archive. This could again be achieved with a browser extension; either combined with the other one, or developed and packaged independently (modularity!).

Whenever the user is about to visit a web page, the browser would have the possibility fetch it from the live web, but now also to query the archive for previous snapshots of it. It could show a page's most recent snapshot as a fallback whenever the live page is unavailable; or if preferred even show the snapshot by default, for 'offline-first' browsing.

Beyond one's own personal archive, one could connect to multiple archives, which can be any archives that support Memento, and query any or all of them to find a snapshot. This gets us back to the vision of a distributed, emergent, collective web of archives.

Implementation

The above described the desired architecture as my idea of it solidified during the project. The current implementation still lags a little behind the target, but the basics are there. I implemented the first two tools that were described: the browser extension that makes snapshots of pages, and a minimal form of the archive server that can host them. I did not implement the third tool yet, but one can use already existing Memento clients, such as the Memento Time Travel browser extension, to get the basic functionality.

The browser extension

As I took off from the already existing WebMemex code, the browser extension I developed does more than just giving a snapshot file. It also saves these files in its internal storage, so that it can access and search through them, to provide some of the personal archive features already.

However, the storage inside a browser extension is a silo, inaccessible from one's file manager or other applications. It is not obvious to a user how to do basic things like making a back-up of the files or moving them to another computer; I would have to reinvent those wheels specifically for the extension. These issues were strong motivators to come up with the above architecture design consisting of multiple, composable tools.

The main task of the browser extension should be snapshotting a web page, and most of the effort has gone into coding this part as well as possible. Because there are so many cases where snapshotting a web page would be useful, this code has been developed as a separate, reusable module called freeze-dry (I actually started on freeze-dry last year, but rewrote it from scratch).

The WebMemex browser extension is just one possible application that can be built around freeze-dry. It could also be integrated into browsers (e.g. replace the "Save Page As..." button!), be used for automatic capturing of pages in headless browsers, and could even be used on a virtual dom like jsdom.

I like to see code modularisation as the pursuit of composability at a second level: while users can mix and match compatible tools, developers can mix and match modules to easily make a large variety of those tools.

How freeze-dry works

As it is the main deliverable of the project, I will elaborate a bit on the internals of freeze-dry. On a first glance, saving a web page seems a simple task. It just has to read the current DOM as an HTML string, so a good first approximation to freeze-dry would be:

const freezeDry = document => document.documentElement.outerHTML

But, as one user expressed in a spontaneous email: "I was surprised and saddened by just how much work it takes to do what freeze-dry does!". It gets more complex because to get the a full snapshot of the page, freeze-dry also has to fetch and inline external images, stylesheets, etcetera as data: urls (i.e. base64-encoded strings); and recurse into the stylesheets, inlining their fonts, images, and imported stylesheets; recurse into iframes; convert relative URLs into absolute ones; remove scripts and event handlers; ...add a couple more steps like these, and we easily end up with over 2k lines of code (for full details on how freeze-dry works, look at the source code).

The result of freeze-drying is a snapshot of the page in the form of a single string of HTML that is..:

self-contained: all subresources, such as images and stylesheets, are inside the snapshot (which can be several megabytes).
context-free: not tied to its original location, i.e. it has no relative URLs.
static: scripts are removed, to obtain a document one can view, reuse, annotate, and perhaps even edit without surprises.

Put the resulting string in a file, and your freeze-dried page can be preserved for many years to come.

Beyond the required changes to preserve the page, freeze-dry tries to make as few changes as possible. One exception to this is that it adds two pieces of metadata to the page, to remember its original URL and the moment the snapshot was made. As I could not find an existing standard for expressing this information, I chose to embed the equivalent HTTP headers as they are defined for Memento protocol, so a snapshot will contain two lines like this:

<meta http-equiv="Memento-Datetime" content="Sat, 18 Aug 2018 18:02:20 GMT">
<link rel="original" href="https://example.com/main/page.html">

One other small addition to the HTML tree is made in order to ensure the snapshot is self-contained: freeze-dry adds a <meta> tag with a Content Security Policy (CSP), that instructs browsers not to fetch or connect to any outside resources which the snapshot might accidentally still refer to (e.g. if freeze-dry neglected to inline some yet unknown or non-standard type of subresource). This CSP solves two issues: firstly, internet connectivity could pose a security or privacy issue, especially as one would expect to be viewing a document locally. But, perhaps more importantly, the CSP ensures reliability: one can be sure a snapshot will still look the same when one is offline, or when the links to its subresources have rotted away.

Personal archive server

To provide the second tool, that enables people to host their snapshots in a personal web archive, I decided to prototype a simple web server that can easily be run on people's usual personal servers. Unfortunately, personal servers are not a usual thing at all. I found however that NextCloud comes close enough to use it for building a proof of concept.

I made two small NextCloud apps:

raw allows using NextCloud as a static web server. Because by itself, NextCloud does not serve user files raw. It only provides them as a download or displays them within its own user interface (where for an HTML file it only shows the source code).
memento makes the server speak the Memento protocol, so one can ask it for a snapshot of a given URL near a given date. Upon receiving a Memento query, the app searches (linearly, for now) through all HTML files one has stored inside their NextCloud, and looks if any of them are snapshots of the requested URL (i.e., if it has the requested URL in a <link rel="original" ...> tag). If there are multiple matches, it looks at the snapshot dates to pick the closest match. It then redirects the client to the 'raw' app to serve that snapshot.

While this is a very minimal prototype of a personal archive server, I do like that it follows the ideal of modularity and composability. For instance, the archive is quite independent from the browser extension that saves the pages (their only coupling being the particular metadata format of the snapshots, and support for other snapshot formats could easily be added). Likewise, to query the server, any Memento client should work. To get snapshots onto the server, any of NextCloud's options can be used, such as syncing with a local folder or dragging and dropping snapshot files onto its web interface.

Way forward

Much work is remaining on the way to personal web archiving, but I hope this project has made some baby steps forward; small contributions towards changing the way we use, along with the way we conceptualise the web. I would be very satisfied if some more people would get used to thinking of…

…documents as being different from services/applications, with web pages often being an awkward mix.
…web pages as being fetched from the web, instead of being viewed on the web.
…web pages as being storable, tangible, ownable things.
…pages stored on their computer as still being part of the web.
…fetching a page as a way to get the latest edition, when their own version may be outdated.
…fetching a page from anybody who has it, rather than always from the original publisher.

Of course, much of the web today won't fit these concepts, but that is exactly the first point: we should distinguish between live, interactive services (e.g. a web shop or SaaS application) and web pages that are primarily a document which we can refer to by its URL and expect to remain available. If we keep developing the web mainly for the living services, but meanwhile keep using the web as if it were a global library of interconnected literature, we would remain stuck in the digital dark age where knowledge decays within a decade.

Work ahead

I will wrap up with some thoughts on what may be worth working on going forward, besides the obvious steps to improve freeze-dry and related tools. Perhaps most of all, my hope is to increase the composability of tools in general, to give users more freedom in choosing their tools and workflows.

Insistence on modularity and composability may yield fruits in the larger development ecosystem, but it may not be the easiest way to get quick and visible results. For example, I would have liked to support direct communication between the browser extension and archive server, to store newly created snapshots directly on the server instead of inside the extension, but did not find a satisfactory way to do this without resorting to a custom or obscure protocol.

In fact, the initial reason to try use NextCloud was that it supports WebDAV, a simple protocol enabling us to PUT files onto a server, though I did not get around to implement that in the client. Also WebDAV would be only part of the solution, as aspects such as configuration and authentication also need to be addressed to provide a smooth user experience. UX for composable tools is hard.

To work together, the browser extension and the server apps largely rely on the capabilities of the platforms around them. And that is probably the way to go. To make convenient, composable tools, most effort has to be put not into the tools themselves but into the platforms they run on and their protocols and interfaces.

For example, we could improve the browser so that one can drag a file directly from within the browser to one's server, instead of having to first save it in a folder. Or we could even support automating such workflows with something like webintents (but, another insight during this project: first make things possible, then make them efficient).

I could come up with a whole bunch of other things that would be helpful for this project and others like it. I will just list a couple of broad directions for inspiration:

Support Memento in browsers, servers, and other applications.
Support for reading MHTML and WARC files in browsers, or develop another self-contained web package format.
Improve caching in browsers, changing the UI to differentiate between a local cached version and the 'live' online version.
Make web pages as documents instead of applications; make them usable without javascript.
Invent and define a limited form of javascript that can be reasoned about and archived reliably.
Make personal, read/write web servers with standardised protocols. WebDAV is a primitive start; RemoteStorage adds some features to it; Solid might take things further still.
Work on 'named data' approaches to the web; on projects like IPFS and Dat&Beaker, so documents can be fetched from any source, and without needing to trust the source.
Last but not least, work out ways to do protect people's privacy while querying others for resources.

While for the moment I am occupied elsewise, I hope to contribute to some of these directions in the future, and that whoever read all the way to the end of this post may have an increased motivation to do so too.