With the help of a Firefox extension, Recoll can index the Internet pages that you visit. The extension has a long history: it was initially designed for the Beagle indexer, then adapted to Recoll and the Firefox XUL API. The current version of the extension is located in the Mozilla add-ons repository uses the WebExtensions API, and works with current Firefox versions.
The extension works by copying visited Web pages to an indexing queue directory, which Recoll then processes, storing the data into a local cache, then indexing it, then removing the file from the queue.
The local cache is not an archive
As
mentioned above, a copy of the indexed Web pages is retained by
Recoll in a local cache (from which data is fetched for previews,
or when resetting the index). The cache is not changed by an
index reset, just read for indexing. The cache has a maximum
size, which can be adjusted from the Index
configuration / Web history panel
(webcachemaxmbs
parameter
in recoll.conf
). Once the maximum size is
reached, old pages are erased to make room for new ones. The
pages which you want to keep indefinitely need to be explicitly
archived elsewhere. Using a very high value for the cache size
can avoid data erasure, but see the above 'Howto' page for more
details and gotchas.
The visited Web pages indexing feature can be enabled on the
Recoll side from the GUI Index configuration
panel, or by editing the configuration file (set
processwebqueue
to 1).
The Recoll GUI has a tool to list and edit the contents of the Web cache. ( → )
The recollindex command has two options to help manage the Web cache:
--webcache-compact
will recover the space from erased entries. It may need to use twice the disk space currently needed for the Web cache.--webcache-burst
will extract all current entries into pairs of metadata and data files created insidedestdir
destdir
You can find more details on Web indexing, its usage and configuration in a Recoll 'Howto' entry.