Teraslice is an open source, distributed computing platform for processing JSON data stored in Elasticsearch. It can be used for many tasks but is particularly adept at migrating and transforming data within and between Elasticsearch clusters and other data stores. It was born and bred in an environment that regularly sees billions of pieces of data per day and is capable of processing millions of records per second.
Here are a few tasks it can help you with:
- Reindexing data at high volumes
- Moving data between clusters
- Moving data out of Elasticsearch into other systems
- Automatic syncing of data from Elasticsearch to other systems
- Exporting data to files
- Importing data from a previous export
- Periodic processing of data stored in an index
In Teraslice you define jobs that specify a pipeline of work to be applied to a slice of data. That work will execute concurrently across many workers to achieve very high re-processing throughput.
The only requirement that Teraslice makes is that the data is sliceable using date ranges. So as long as your index has a date field that varies across records then you can use it to slice things up and concurrently reprocess the data in the index.
Jobs are specified as simple JSON documents. Here's a simple reindexing example.
{
"name": "Reindex",
"lifecycle": "once",
"workers": 1,
"operations": [
{
"_op": "elasticsearch_reader",
"index": "example-logs",
"type": "logs",
"size": 10000,
"date_field_name": "created",
"full_response": true
},
{
"_op": "elasticsearch_index_selector",
"type": "change",
"index": "example-logs-new",
"id_field": "_key"
},
{
"_op": "elasticsearch_bulk",
"size": 10000
}
]
}
In this instance all the operations that are performed are provided by Teraslice. The elasticsearch_reader takes a date range and slices up the index so that it can be reprocessed. The result of that is fed to elasticsearch_index_selector that takes the incoming data and bundles it into a bulk request that is then sent to the cluster using elasticsearch_bulk_insert.
Operations are nothing more than Javascript modules and writing your own is easy so custom code can be inserted in the operations flow as needed.
Teraslice is currently in alpha status. Single node deployment and clustering support are functional and being refined. APIs are usable but will still be evolving as we work toward a production release. See the list of open issues for other limitations.
Teraslice is written in Node.js and has been tested on Linux and Mac OS X.
- Node.js 8 or above
- Yarn (development only)
- At least one Elasticsearch 5 or above cluster
# Install teraslice globally
npm install --global teraslice
# Or with yarn, yarn global add teraslice
# A teraslice CLI client
npm install --global teraslice-cli
# Or with yarn, yarn global add teraslice-cli
# To add additional connectors, use
# npm install --global terascope/terafoundation_kafka_connector
Teraslice requires a configuration file in order to run. The configuration file defines your service connections and system level configurations.
This configuration example defines a single connection to Elasticsearch on localhost with 8 workers available to Teraslice.
The cluster configuration defines this node as a master node. The node will still have workers available and this configuration is sufficient to do useful work if you don't have multiple nodes available. The workers will connect to the master on localhost and do work just as if they were in a real cluster.
teraslice:
workers: 8
master: true
master_hostname: "127.0.0.1"
name: "teracluster"
terafoundation:
environment: 'development'
log_path: '/path/to/logs'
connectors:
elasticsearch:
default:
host:
- "localhost:9200"
Configuration for a worker node is very similar. You just set 'master' to false and provide the IP address where the master node can be located.
teraslice:
workers: 8
master: false
master_hostname: "YOUR_MASTER_IP"
name: "teracluster"
terafoundation:
environment: 'development'
log_path: '/path/to/logs'
connectors:
elasticsearch:
default:
host:
- "YOUR_MASTER_IP":9200
Once you have Teraslice installed you need a job specification and a configuration file to do something useful with it. See above for simple examples of each.
Starting the Teraslice service on the master node is simple. Just provide it a path to the configuration file.
teraslice -c master-config.yaml
Starting a worker on a remote node is basically the same.
teraslice -c worker-config.yaml
git clone https://github.com/terascope/teraslice.git
cd teraslice
# If install the root level depedencies
yarn
# Ensure the projects are installed, built and linked together
yarn setup
yarn test # this will run all of the tests
cd packages/[package-name]; yarn test # this will run an individual test
yarn start -c path/to/config.yaml
# Use latest or any tag from https://hub.docker.com/r/terascope/teraslice/tags/
FROM terascope/teraslice:latest
# Add terafoundation connectors here
RUN yarn add terascope/terafoundation_kafka_connector
docker build -t custom-teraslice .
docker run -it --rm -v ./teraslice-master.yaml:/app/config/teraslice.yml custom-teraslice
The master publishes a REST style API on port 5678.
To submit a job you just post to the /jobs endpoint.
Assuming your job is in a file called 'job.json' it's as simple as
curl -XPOST "$YOUR_MASTER_IP:5678/v1/jobs" [email protected]
or using teraslice-cli
tjm job register -c="$YOUR_MASTER_IP:5678/v1/jobs" ./job.json
This will return the job_id (for access to the original job posted) and the job execution context id (the running instance of a job) which can then be used to manage the job. This will also start the job.
{
"job_id": "5a50580c-4a50-48d9-80f8-ac70a00f3dbd",
}
Please check the api docs at the bottom for a comprehensive in-depth list of all api's. What is listed here is just a small brief of only a few api's
This will retrieve the job configuration including '_status' which indicates the execution status of the job.
curl "$YOUR_MASTER_IP:5678/v1/jobs/${job_id}/ex"
or using teraslice-cli
tjm job status ./job.json
Stopping a job stops all execution and frees the workers being consumed by the job on the cluster.
curl -XPOST "$YOUR_MASTER_IP:5678/v1/jobs/${job_id}/_stop"
or using teraslice-cli
tjm job stop ./job.json
Posting a new job will automatically start the job. If the job already exists then using the endpoint below will start a new one.
curl -XPOST "$YOUR_MASTER_IP:5678/v1/jobs/${job_id}/_start"
or using teraslice-cli
tjm job start ./job.json
Starting a job with recover will attempt to replay any failed slices from previous runs and will then pickup where it left off. If there are no failed slices the job will simply resume from where it was stopped.
curl -XPOST "$YOUR_MASTER_IP:5678/v1/jobs/${job_id}/_recover"
Pausing a job will stop execution of the job on the cluster but will not release the workers being used by the job. It simply pauses the execution and stops allocating work to the workers. Workers will complete the work they're doing then just sit idle until the job is resumed.
curl -XPOST "$YOUR_MASTER_IP:5678/v1/jobs/${job_id}/_pause"
or using teraslice-cli
tjm job pause ./job.json
Resuming a job restarts the execution and the allocation of slices to workers.
curl -XPOST "$YOUR_MASTER_IP:5678/v1/jobs/${job_id}/_resume"
or using teraslice-cli
tjm job resume ./job.json
This provides information related to the execution controller and can be useful in monitoring and optimizing the execution of the job.
curl "$YOUR_MASTER_IP:5678/v1/jobs/${job_id}/controller"
This will show you all the connected workers and the tasks that are currently assigned to them.
curl "$YOUR_MASTER_IP:5678/v1/cluster/state"