Skip to content

Workflows

Matthew Caruana Galizia edited this page Jan 7, 2017 · 7 revisions

The two commands at the core of Extract are queue and spew. The former will recursively scan a given path and queue all the files it finds (with pattern exceptions possible) in a distributed queue. The latter will either pull files from a distributed queue or scan a path in a separate thread, using its own internal queue, and spew out text and metadata.

Text Extraction

If you're only processing a few thousand files, then running a single instance of Extract without a queue is sufficient:

extract spew -r redis -o file --outputDirectory /path/to/text /path/to/files

The -r parameter is used to tell Extract to save the result of each file processed to Redis. In this way, if you have to stop the process, then you can resume where you left off as successfully processed files will be skipped.

You'll probably want to do something more useful with extracted text than save to disk. In that case, you can get Extract to write to a Solr endpoint:

extract spew -r redis -o solr -s http://solr-1:8983/solr/my_core /path/to/files`

When spewing is done, trigger a commit so that results will show up in Solr:

extract commit -s http://solr-1:8983/solr/my_core`

Or rollback the changes:

extract rollback -s http://solr-1:8983/solr/my_core`

Distributed Text Extraction

This is the workflow we use at ICIJ for processing millions of files in parallel. The --queueName parameter is used to namespace the job and avoid conflicts with unrelated jobs using the same Redis server.

First, queue the files from your directory. For best performance, you should probably run this directly on the machine that the volume containing the files is connected to and not over the network:

cd /mnt/my_files
extract queue --queueName job-1 --redisAddress redis-1:6379 ./ 2> queue.log`

You will be running Extract processes on many different machines, so your should export your file directory as an NFS or kind of network share. After that, mount the share to the same path on each of your extraction cluster machines.

With NFS, this would be done in the following way (where nfs-1 is the hostname of your file server):

sudo mkdir /mnt/my_files
sudo mount -t nfs4 -o ro,proto=tcp,port=2049 nfs-1:/my_files /mnt/my_files

You can then start processing the queue on each of your machines.

cd /mnt/my_files
extract spew --queueName job-1 -q redis -o solr -s http://solr-1:8983/solr/my_core -r redis --redisAddress redis-1:6379 2> extract.log

In the last step, we instruct Extract to use the queue from Redis, to output extracted text to Solr (-o solr) at the given address and to report results to Redis (-r redis).

Backing Up and Restoring a Queue or Report

It's possible to dump a queue or report to a backup file in case we need to restore either later on.

extract dump-queue --queueName suspicious-files --redisAddress redis-1:6379 queue.json
extract dump-report --reportName suspicious-files --redisAddress redis-1:6379 report.json

Restoring is simple:

extract load-queue --queueName suspicious-files --redisAddress redis-1:6379 queue.json
extract load-report --reportName suspicious-files --redisAddress redis-1:6379 report.json

It's also possible to use I/O redirection if your command line environment supports it. For example, the above commands could be rewritten as:

extract dump-queue --queueName suspicious-files --redisAddress redis-1:6379 > queue.json
extract wipe-queue --queueName suspicious-files --redisAddress redis-1:6379
extract load-queue --queueName suspicious-files --redisAddress redis-1:6379 < queue.json

Reindexing After Solr Schema Changes

You might have made a mistake in your original schema and now need to change the type of a field, or changed the way it's tokenised. You can edit the schema and make as many changes as you like, but the original data would still be stored and indexed as specified in the old schema.

There are two ways you can work around this: reindex all your files again, or use the solr-copy command, which pulls the fields you specify from each document and adds them back to the same document, forcing reindexing.

A common example is when you change a string field to a Trie number field after indexing. Solr will then return an error message in place of these fields. To fix them automatically, run solr-copy filtering on the bad field.

extract copy -f "my_numeric_field:* AND -my_numeric_field:[0 TO *]" -s ...

This will cause the copy command to run only on those fields which have a non-number value on the number-type field.

Internally, Extract will perform an atomic update of the specified fields only, without causing the other document fields to be lost.