= Indexing visited Web pages with the Recoll Firefox extension == Overview The link:https://addons.mozilla.org/en-US/firefox/addon/recoll-we/[Firefox Recoll-WE] Firefox add-on downloads the Web pages you visit, for later caching and indexing by Recoll. == History The extension works with Recoll in a slightly convoluted way because it has a complicated history. - It began its life as a complement to the Beagle indexer, and was then copied and modified to work with Recoll - This first version was based on the Firefox XUL API, and was almost fully rewritten to use the WebExtensions API when XUL was made obsolete, a few years ago. The new extension is largely based on code stolen from the *save-page-we* extension. The current version works with Recoll 1.23.5 or newer. The repository for the current extension code is hosted on link:https://framagit.org/medoc92/recoll-we[framagit]. == How things works Some values in the following can be configured in the Recoll configuration, see xref:config[further down]. - The extension saves a copy of the visited page data and metadata in two files inside your Firefox Downloads directory (or a subdirectory, see the xref:config[configuration section). The files are named something like _recoll-we-m-[url hash].rclwe_ for the metadata and _recoll-we-c-[url hash].rclwe_ for the data. - When the Recoll indexer runs, it first processes the normal file system files, then, if Web indexing is enabled, it runs the `recoll-we-move-files.py` script (from the Recoll filters directory). The script moves the files from the Downloads directory to the Recoll Web indexing queue directory, normally `~/.recollweb/ToIndex`. This step was introduced with the switch to WebExtensions, to avoid modifying the indexer itself too much and keep compatibility with the old extension. - The indexer then runs the specific Web pages indexing code proper. This: * Indexes the data. * Saves the data and metadata into the Recoll Web cache storage (by default `~/.recoll/webcache`), and deletes the queue files. .The Recoll Web cache is not an archive NOTE: Old content will be deleted and become unsearchable when new content needs space. The maximum cache size is xref:rconfig[configurable]. Pages that you want to archive permanently need to be saved elsewhere, as they will otherwise eventually disappear from the Recoll results. Recoll can index `.maff` files, which may be a better choice for archival usage, or also see the link:https://addons.mozilla.org/en-US/firefox/addon/save-page-we/[Save Page WE] Firefox extension. .Turning the cache into an archive NOTE: You could conceivably turn the Web cache into an archive by setting the maximum size absurdly high. While this will work, be aware that space will be wasted when a new version of a page is stored: storage for the old versions will be erased, but not reclaimed. It is possible to compact the cache by copying it though, the code for this exists in a test driver (in the source set, but not built or packaged by default). [[config]] == Configuration [[xconfig]] === Extension Configuration By default, after installation, the extension will store all the pages which you visit. You can change this behaviour through configuration options in the addon preferences page or through the context menu: Download subdirectory:: Subdirectory of the Firefox Downloads directory where the extension will create its files. By default, this is empty, and the `recoll-we-move-files.py` script expects to find its files under `~/Downloads`. You can set a value to avoid pollution of the main Downloads directory (e.g. _recoll-we_). In this case, you also need to set the `webdownloadsdir` configuration variable (e.g. `webdownloadsdir = ~/Downloads/recoll-we`). Automatically index pages:: If this is not set, a page will only be saved if requested by clicking the toolbox button or selecting the submenu action. If this is set, pages will be automatically saved, subject to the rules below. Also do it for pages with secure content (https):: Enable/disable the same behaviour for https URLs. NOTE: The options which follow only have any effect if automatic indexing has been activated (see above). Save by default (when no rules set matches):: This is originally set. It It is automatically unset the first time you add an inclusion rule. Save when both rules sets match:: If set, index the page when both the inclusion and exclusion rule sets match. URL include rules:: Rules to select the URLs which will be automatically indexed. URL exclude rules:: Rules to select URLs which will not be indexed. Rules for both sets can be of three types: domain (just select by host name), wildcard, or regular expression. Rules added through the context menu are of the 'domain' type. [[rconfig]] === Recoll Configuration The `.rclwe` extension should be added to the `noContentSuffixes` configuration variable. This was not the case by default until Recoll 1.27.6. You can add it by adding the following to the `recoll.conf` configuration file: noContentSuffixes+ = .rclwe This can be also be done with the GUI preferences: Preferences->indexing configuration->Local Parameters->Ignored endings processwebqueue:: the web pages saved by the extension will only be indexed if this is set in the recoll configuration. This can be done by editing `recoll.conf` or in the GUI preferences Preferences->Index configuration->Web history webcachemaxmbs:: the Web cache maximum size can be set through this configuration parameter, which is also accessible from the GUI: Preferences->Index configuration->Web history webdownloadsdir:: if your browser default downloads directory is not `~/Downloads`, or if you set the subdirectory option in the extension preferences, you will need to set this to the appropriate value in 'recoll.conf' (e.g. `webdownloadsdir = ~/Downloads/recoll-we`). webcachedir:: you can set this to change the location where the Recoll Web cache is stored. By default, this is a `webcache` subdirectory of the Recoll configuration directory. webqueuedir:: this is the intermediate staging location where the `recoll-we-move-files.py` stores the pages, and where they are retrieved by the Recoll indexer. The default value is _~/.recollweb/ToIndex_. The variable is used by the script and by recollindex.