= Indexing visited Web pages with the Recoll Firefox extension


== Overview

The link:https://addons.mozilla.org/en-US/firefox/addon/recoll-we/[Firefox
Recoll-WE] Firefox add-on downloads the Web pages you visit, for later caching
and indexing by Recoll.

== History

The extension works with Recoll in a slightly convoluted way because it has
a complicated history.

- It began its life as a complement to the Beagle indexer, and was then
  copied and modified to work with Recoll
- This first version was based on the Firefox XUL API, and was almost fully
  rewritten to use the WebExtensions API when XUL was made obsolete, a few
  years ago. The new extension is largely based on code stolen from the
  *save-page-we* extension. 

The current version works with Recoll 1.23.5 or newer.

The repository for the current extension code is hosted on
link:https://framagit.org/medoc92/recoll-we[framagit].

== How things works

Some values in the following can be configured in the Recoll configuration,
see xref:config[further down].

- The extension saves a copy of the visited page data and metadata in
  two files inside your Firefox Downloads directory (or a subdirectory, see
  the xref:config[configuration section). The files are named something like
  _recoll-we-m-[url hash].rclwe_ for the metadata and _recoll-we-c-[url
  hash].rclwe_ for the data. 
- When the Recoll indexer runs, it first processes the normal file system
  files, then, if Web indexing is enabled, it runs the
  `recoll-we-move-files.py` script (from the Recoll filters directory). The
  script moves the files from the Downloads directory to the Recoll Web
  indexing queue directory, normally `~/.recollweb/ToIndex`. This step was
  introduced with the switch to WebExtensions, to avoid modifying the
  indexer itself too much and keep compatibility with the old extension.
- The indexer then runs the specific Web pages indexing code proper. This:
* Indexes the data.
* Saves the data and metadata into the Recoll Web cache storage (by default
  `~/.recoll/webcache`), and deletes the queue files.

.The Recoll Web cache is not an archive

NOTE: Old content will be deleted and become unsearchable when new content
needs space. The maximum cache size is xref:rconfig[configurable]. Pages
that you want to archive permanently need to be saved elsewhere, as they
will otherwise eventually disappear from the Recoll results. Recoll can
index `.maff` files, which may be a better choice for archival usage, or
also see the
link:https://addons.mozilla.org/en-US/firefox/addon/save-page-we/[Save Page
WE] Firefox extension.

.Turning the cache into an archive

NOTE: You could conceivably turn the Web cache into an archive by setting
the maximum size absurdly high. While this will work, be aware that space
will be wasted when a new version of a page is stored: storage for the old
versions will be erased, but not reclaimed. It is possible to compact the
cache by copying it though, the code for this exists in a test driver (in
the source set, but not built or packaged by default).

[[config]]
== Configuration

[[xconfig]]
=== Extension Configuration

By default, after installation, the extension will store all the pages
which you visit.

You can change this behaviour through configuration options in the addon
preferences page or through the context menu:

Download subdirectory:: Subdirectory of the Firefox Downloads directory
where the extension will create its files. By default, this is empty, and
the `recoll-we-move-files.py` script expects to find its files under
`~/Downloads`. You can set a value to avoid pollution of the main
Downloads directory (e.g. _recoll-we_). In this case, you also need to set
the `webdownloadsdir` configuration variable (e.g. `webdownloadsdir =
~/Downloads/recoll-we`).

Automatically index pages:: If this is not set, a page will only be saved
if requested by clicking the toolbox button or selecting the submenu
action. If this is set, pages will be automatically saved, subject to the
rules below.

Also do it for pages with secure content (https):: Enable/disable the same
behaviour for https URLs.

NOTE: The options which follow only have any effect if automatic indexing has
been activated (see above).

Save by default (when no rules set matches):: This is originally set. It It
is automatically unset the first time you add an inclusion rule.

Save when both rules sets match:: If set, index the page when both the
inclusion and exclusion rule sets match.

URL include rules:: Rules to select the URLs which will be
automatically indexed. 

URL exclude rules:: Rules to select URLs which will not be indexed.

Rules for both sets can be of three types: domain (just select by host
name), wildcard, or regular expression. Rules added through the context
menu are of the 'domain' type.



[[rconfig]]
=== Recoll Configuration

The `.rclwe` extension should be added to the `noContentSuffixes`
configuration variable. This was not the case by default until Recoll
1.27.6. You can add it by adding the following to the `recoll.conf`
configuration file:

    noContentSuffixes+ = .rclwe

This can be also be done with the GUI preferences:

    Preferences->indexing configuration->Local Parameters->Ignored endings

processwebqueue:: the web pages saved by the extension will only be indexed
if this is set in the recoll configuration. This can be done by editing
`recoll.conf` or in the GUI preferences

    Preferences->Index configuration->Web history

webcachemaxmbs:: the Web cache maximum size can be set through this
configuration parameter, which is also accessible from the GUI:

    Preferences->Index configuration->Web history

webdownloadsdir:: if your browser default downloads directory is not
`~/Downloads`, or if you set the subdirectory option in the extension
preferences, you will need to set this to the appropriate value in 
'recoll.conf' (e.g. `webdownloadsdir = ~/Downloads/recoll-we`).

webcachedir:: you can set this to change the location where the Recoll Web
cache is stored. By default, this is a `webcache` subdirectory of the
Recoll configuration directory.

webqueuedir:: this is the intermediate staging location where the
`recoll-we-move-files.py` stores the pages, and where they are retrieved by
the Recoll indexer. The default value is _~/.recollweb/ToIndex_. The
variable is used by the script and by recollindex.