Unix-like systems: indexing visited Web pages

With the help of a Firefox extension, Recoll can index the Internet pages that you visit. The extension has a long history: it was initially designed for the Beagle indexer, then adapted to Recoll and the Firefox XUL API. The current version of the extension is located in the Mozilla add-ons repository uses the WebExtensions API, and works with current Firefox versions.

The extension works by copying visited Web pages to an indexing queue directory, which Recoll then processes, storing the data into a local cache, then indexing it, then removing the file from the queue.

The local cache is not an archive

As mentioned above, a copy of the indexed Web pages is retained by Recoll in a local cache (from which data is fetched for previews, or when resetting the index). The cache is not changed by an index reset, just read for indexing. The cache has a maximum size, which can be adjusted from the Index configuration / Web history panel (webcachemaxmbs parameter in recoll.conf). Once the maximum size is reached, old pages are erased to make room for new ones. The pages which you want to keep indefinitely need to be explicitly archived elsewhere. Using a very high value for the cache size can avoid data erasure, but see the above 'Howto' page for more details and gotchas.

The visited Web pages indexing feature can be enabled on the Recoll side from the GUI Index configuration panel, or by editing the configuration file (set processwebqueue to 1).

The Recoll GUI has a tool to list and edit the contents of the Web cache. (ToolsWebcache editor)

The recollindex command has two options to help manage the Web cache:

  • --webcache-compact will recover the space from erased entries. It may need to use twice the disk space currently needed for the Web cache.
  • --webcache-burst destdir will extract all current entries into pairs of metadata and data files created inside destdir

You can find more details on Web indexing, its usage and configuration in a Recoll 'Howto' entry.