Multi-process log processing for nodejs
Table of Contents
- Intro
- Example
- Installing
- Status
- Process management
- Committing logs
- Read API
- Procs
- SystemProcs
- LiveProcs
- Waiting for procs to complete
- GC
- Typings
- Tests
- Meta
- Contributing
PipeProc is a data processing system that can be embedded in nodejs applications (eg. electron).
It will be run in a separate process and can be used to off-load processing logic from the main “thread” in a structured manner.
Underneath it uses a structured commit log and a “topic” abstraction to categorize logs.
Inspired by Apache Kafka and Redis streams.
In practice it is a totally different kind of system since it is meant to be run embedded in the main application as a single instance node.
Another key difference is that it also handles the execution of the processing logic by itself and not only the stream pipelining.
It does this by using processors which are custom-written modules/functions that can be plugged to the system, consume topic streams, execute custom logic and push the results to another topic, thus creating a processing pipeline.
const {PipeProc} = require("pipeproc");
const pipeProcClient = PipeProc();
pipeProcClient.spawn().then(function() {
//commit a log to topic "my_topic_1"
//the topic is created if it does not exists
pipeProcClient.commit({
topic: "my_topic",
body: {greeting: "hello"}
}).then(function(id) {
console.log(id);
//1518951480106-0
//{timestamp}-{sequenceNumber}
});
});
npm install --save pipeproc
Spawn the node and connect to it.
If there is a need to spawn multiple nodes on the same host you can use the namespace
option with a custom name.
If a custom namespace
is used, all clients that will connect()
to it will need to provide it.
The node can also use TCP connections instead by setting the host and the port in the tcp
settings.
Clients (local and remote) can then connect()
to it by providing the same host and port.
A socket address can also be used instead of a namespace or tcp options, for example:
ipc:///tmp/mysocket
tcp://127.0.0.1:9999
TLS is also available when using TCP. A cert, key, and ca should be provided for both server and client.
Any client that also needs to connect()
should provide its client keys.
If the node is spawned with TLS, only secure connections will be allowed.
spawn(
options?: {
//use a different ipc namespace
namespace?: string,
//use a tcp socket
tcp?: {
host: string,
port: number
},
//tls settings
tls?: {
server: {
key: string;
cert: string;
ca: string;
},
client: {
key: string;
cert: string;
ca: string;
}
}
//use a socket address directly
socket?: string,
//use an in-memory store instead of the disk adapter
memory?: boolean,
//set the location of the underlying store (if memory is false)
location?: string,
//the number of workers(processes) to use (check the systemProc section below), set to 0 for no workers, defaults to 1
workers?: number,
//the number of processors that can be run concurrently by each worker, defaults to 1
workerConcurrency?: number,
//restart any worker that reaches X systemProc executions, useful to mitigate memory leaks, defaults to 0 (no restarts)
workerRestartAfter?: number,
//tune the garbage collector settings (check the gc section below)
gc?: {minPruneTime?: number, interval?: number} | boolean
}
): Promise<string>;
Connect to an already spawned node.
Usecase: Connect to the same PipeProc instance from a different process (eg. electron renderer) or to a remote instance (only when using TCP)
connect(
options?: {
//use a different ipc namespace
namespace?: string,
//connect to a tcp socket
tcp?: {
host: string,
port: number
},
//tls settings
tls?: {
key: string;
cert: string;
ca: string;
} | false,
//use a socket address directly
socket?: string,
//specify a connection timeout, defaults to 1000ms
timeout?: number
}
): Promise<string>;
Gracefully close the PipeProc instance.
shutdown(): Promise<string>;
This is how you add logs to a topic.
The topic will be created implicitly when its first log is committed.
Multiple logs can be committed in a batch, either in the same topic or to different topics, in that case
the write will be an atomic operation and either all logs will be successfully written or all will fail.
Add a single log to a topic:
pipeProcClient.commit({
topic: "my_topic",
body: {greeting: "hello"}
}).then(function(id) {
console.log(id);
//=> 1518951480106-0
});
commit()
will return the id(s) of the log(s) committed.
Ids follow a format of {timestamp}-{sequenceNumber}
where timestamp is the time the log was committed in milliseconds
and the sequence number is an auto-incrementing integer (starting from 0) indicating the log's position in its topic.
The log's body can be an arbitrarily nested javascript object.
Adding multiple logs to the same topic:
pipeProcClient.commit([{
topic: "my_topic",
body: {
myData: "some data"
}
}, {
topic: "my_topic",
body: {
myData: "more data"
}
}]).then(function(ids) {
console.log(ids);
//=> ["1518951480106-0", "1518951480106-1"]
});
Notice the timestamps are the same since the two logs where inserted at the same time but the sequence number is different and auto-increments.
Adding multiple logs to different topics:
pipeProcClient.commit([{
topic: "my_topic",
body: {
myData: "some data"
}
}, {
topic: "another_topic",
body: {
myData: "some data for another topic"
}
}]).then(function(ids) {
console.log(ids);
//=> ["1518951480106-0", "1518951480106-0"]
});
As before, the timestamps are the same (since they were committed at the same time) but the sequence numbers are both 0
since these two logs are the first logs committed in their respective topics.
Get a slice of a topic.
range(
topic: string,
options?: {
start?: string,
end?: string,
limit?: number,
exclusive?: boolean
}
): Promise<{id: string, body: object}[]>;
pipeProcClient.range("my_topic", {
start: "1518951480106-0",
end: "1518951480107-10"
})
//timestamps only
pipeProcClient.range("my_topic", {
start: "1518951480106",
end: "1518951480107"
})
//from beginning to end
pipeProcClient.range("my_topic")
//from specific timestamp to the end
pipeProcClient.range("my_topic", {
start: "1518951480106"
})
//with a limit
pipeProcClient.range("my_topic", {
start: "1518951480106",
limit: 5
})
//by sequence id
pipeProcClient.range("my_topic", {
start: ":5",
end: ":15"
}) //=> [5..15]
//by sequence id exclusive
pipeProcClient.range("my_topic", {
start: ":5",
end: ":15"
exclusive: true
}) //=> [6..14]
//returns a Promise that resolves to an array of logs
[{
id: "1518951480106-0",
body: {
myData: "hello"
}
}]
Ranges through the topic in an inverted order.
start
and end
should also be inverted. (start >= end).
The API is the same as range()
.
eg. to get the latest log
pipeProcClient.revrange("my_topic", {
limit: 1
})
get the total logs in a topic
pipeProcClient.length("my_topic").then(function(length) {
console.log(length);
});
Procs are the way to consistently process logs of a topic.
Let's start with an example and explain as we go along.
//lets add some logs
await pipeProcClient.commit([{
topic: "numbers",
body: {myNumber: 1}
}, {
topic: "numbers",
body: {myNumber: 2}
}]);
//run a proc on the "numbers" topic
const log = await pipeProcClient.proc("numbers", {
name: "my_proc",
offset: ">"
});
//=> log = {id: "1518951480106-0", body: {myNumber: 1}}
try {
//process the log
const incrementedNumber = log.data.myNumber 1;
//ack the operation and commit the result to a different topic
await pipeProcClient.ackCommit("my_proc", {
topic: "incremented_numbers",
body: {myIncrementedNumber: incrementedNumber}
});
} catch (err) {
//something went wrong on our processing, the proc should be reclaimed
console.error(err);
pipeProcClient.reclaim("my_proc");
}
Procs are the way to consistently fetch logs from a topic, process them and commit the results in a safe and serial manner.
So, what's going on in the above example?
- first we add a log to our "numbers" topic
- then we create a proc named "my_proc" with an offset of ">" (it means start fetching from the very beginning of the topic, see more below) for the "numbers" topic
- the proc returns a log (the log we added on the first commit)
- we do some processing (incrementing the number)
- we then acknowledge the operation and commit our result to a different topic
- we are also catching errors in our processing and the ack, in that case the proc must be reclaimed.
If everything goes well, the next time we call the proc it will fetch us our second log 1518951480106-1
.
If something goes wrong and reclaim()
is called the proc will be "reset" and will fetch the first log again.
Until we call ack()
(or ackCommit()
in this case) to move on or reclaim()
to reset, the proc will not fetch us any new logs.
Here is the whole proc signature:
proc(
//for what topic this proc is for
topic: string,
options: {
//the proc name
name: string,
//the proc offset (See below)
offset: string,
//how many logs to fetch
count?: number,
//reclaim settings, see below
maxReclaims?: number,
reclaimTimeout?: number,
onMaxReclaimsReached?: "disable" | "continue"
}
): Promise<null | {id: string, body: object} | {id: string, body: object}[]>;
offsets are how you position the proc to a specific point in the topic.
>
fetch the next log after the latest acked log for this proc. If no logs have been acked yet, it will start from the beginning of the topic.$>
like>
but it will start from new logs and not from the beginning (logs created after the proc’s creation){{specific_log/timestamp}}
- follows therange()
syntax. It can be a full log name, a partial timestamp or a sequence id(:{{id}}
). The next non-acked log AFTER the match will be returned.
Acking the proc is an explicit operation and should be run after the log has successfully been processed by calling ack()
or ackCommit()
.
ack(
procName: string
): Promise<string>;
Returns the logId of the log we just acked. If our proc fetched multiple logs (using count > 1) all of the logs will be acknowledged as processed and the call instead of an id will return a range (1518951480106-0..1518951480106-1
).
The next time the proc is executed it will fetch the next log after the above logId(or range).
ackCommit()
combines an ack()
and a commit()
in an atomic operation. If either of these fail, both will fail.
If something goes wrong while we are processing our log(s) or a PipeProc error is raised when we ack/commit our result, we should call reclaim
.
This will reset the proc, allowing to retry the operation.
In the proc's signature there are some settings for the reclaims, allowing us to control how reclaims work and not retrying failed operations forever or getting stuck.
- maxReclaims - how many times we can call reclaim on a proc before the
onMaxReclaimsReached
strategy is triggered (defaults to 10, set to -1 for no limit) - reclaimTimeout - In order not to get stuck by a bad processing (failing to call
ack()
orreclaim()
), the proc will be automatically be reclaimed after a certain amount of time by the system, this value sets the time. - onMaxReclaimsReached - what to do when the maxReclaims are reached. By default it will "disable" the proc which will raise an error if we try to use the proc. Can be set to "continue" so we can keep reclaiming forever.
Since procs are persisted and are not meant to be used as an once-off operation (use a simple range() for that) they need to be explicitly destroyed.
pipeProcClient.destroyProc("my_proc") // throws if it doesn't exist
.then(function(status) {
console.log(status);
});
.catch(function(err) {
console.error(err);
});
If a destroyed proc is re-run it will be re-created anew without maintaining the previous state.
Inspect the internal proc's state (last claimed/acked ranges etc). Useful for debugging.
pipeProcClient.inspectProc("my_proc") // throws if it doesn't exist
.then(function(proc) {
console.log(proc);
});
.catch(function(err) {
console.error(err);
});
Manually disable the proc or resume it (eg. after reaching maxReclaims
)
pipeProcClient.disableProc("my_proc") // throws if it doesn't exist
//.resumeProc("my_proc") - throws if already active or doesn't exist
.then(function(proc) {
//proc = same as inspectProc output
})
.catch(function(err) {
console.error(err);
});
Manually executing and managing a proc can be tiresome.
SystemProcs will take care of all creation/execution/management of procs while also distributing the load to multiple workers, let's take a look using the above proc example with incrementing numbers but now using a systemProc
and a processor module.
pipeProcClient.systemProc({
name: "my_system_proc",
offset: ">",
from: "numbers",
processor: "/path/to/myProcessor.js",
to: "incremented_numbers"
//all the other standard Proc options
});
//myProcessor.js
module.exports = function(log, done) {
//log = {id: "1518951480106-0", body: {myNumber: 1}}
done(null, {myIncrementedNumber: log.body.myNumber 1});
};
Processors can publish to multiple topics by setting the to
field to an array of topics.
If the to
field is omitted, the processor will not publish any logs (eg. get a log, process it, write the result to a database)
Instead of using a done
callback, you can also return a promise.
If an error is returned in the done callback (or a rejected promise is returned) the proc will be reclaimed.
Processors can also be inlined:
pipeProcClient.systemProc({
name: "number_writer",
offset: ">",
count: 1,
maxReclaims: -1,
reclaimTimeout: 5000,
from: "numbers",
to: "incremented_numbers",
processor: (log, done) => {
done(null, {n: log.body.myNumber 1});
}
});
With liveprocs you can react to topic changes while not having to keep executing the underlying proc.
liveprocs are run in the process in which they are called and are not distributed to the workers like systemProcs.
liveProc(
options: {
topic: string,
//"all" will point the proc to the beginning of the topic (">" offset)
//"live" will start fetching logs created after the liveProc's creation ("$>" offset)
mode: "live" | "all",
//how many logs to fetch each time
count?: number
}
): ILiveProc;
pipeProcClient.liveProc({
topic: "my_topic",
mode: "all"
}).changes(function(err, logs, next) {
if (err) {
//reclaim has to be called manually
return this.reclaim();
} else if (logs) {
//do something with the logs
//ack() is also manual
this.ack().then(function() {
next();
});
}
});
Inside the changes
function you can either return a promise or use the next
callback to keep listening for changes.
liveProc instances also have simpler versions of all of the proc's methods (that implicitly point to the underlying proc)
interface ILiveProc {
changes: (cb: ChangesCb) => ILiveProc;
inspect: () => Promise<IProc>;
destroy: () => Promise<IProc>;
disable: () => Promise<IProc>;
resume: () => Promise<IProc>;
reclaim: () => Promise<string>;
ack: () => Promise<string>;
ackCommit: (commitLog: ICommitLog) => void;
cancel: () => Promise<void>;
}
When you have multiple systemProcs and/or liveProcs running it is sometimes needed to know when all logs in their topics have been acked.
For example when we need to shutdown and exit the application:
pipeProcClient.waitForProcs().then(function() {
pipeProcClient.shutdown().then(function() {
process.exit(0);
});
});
waitForProcs()
can take a proc name or an array of proc names and it will wait only for those to complete.
If nothing is passed then it will wait for every active proc.
Logs are immutable and cannot be edited or deleted after creation, so a garbage collector is needed to make sure our topics don't grow too large.
Every time it runs it performs the following:
- for topics that have no procs attached it will collect all logs that have passed the
minPruneTime
- for topics that have procs attached, it will collect all logs 2 positions behind the last claimed log range, but only if they have also passed the
minPruneTime
You can configure the minPruneTime
and gc interval
when you spawn
the PipeProc node.
By default they are both set to 30000ms.
By default the gc is disabled. It can be enabled by passing true
on the spawn
's gc options or an object
with prune time and interval settings.
- topic, proc and systemProc metadata are left behind even if the topic is empty and/or no longer used
- the
length()
function will return an incorrect number if a part of the topic is collected - there seems to be a problem with the gc timers on OSX, causing the tests to sometimes fail
Since PipeProc is written in typescript all public interfaces are properly typed and should be loaded automatically in your editor.
You can run the test suite with:
npm install --save-dev
npm run test
Distributed under the The 3-Clause BSD License. See LICENSE
for more information.
- Fork it (https://github.com/zisismaras/pipeproc/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request