Statsd Cluster Proxy #284

draco2003 · 2013-03-29T19:18:48Z

Initial P.O.C. of a statsd cluster proxy

Was initially built at Monitorama Hackathon

monitorama/hackathon#56

Dieterbe · 2013-04-01T20:41:42Z

😍

josegonzalez · 2013-04-01T20:46:23Z

❤️

mrtazz · 2013-04-08T16:26:24Z

proxy.js

+  , net      = require('net')
+  , events   = require('events')
+  , hashring = require('hashring')
+  , config   = require('./proxyConfig');


would it make sense to use the config module here? (and adapt it where we need to)

mrtazz · 2013-04-08T16:28:34Z

I think the function (regular and anonymous in the event listener) could use better/more formal documentation. But otherwise 🚢

…tch to using the health status

mrtazz · 2013-05-12T18:46:53Z

@draco2003 do we have any concerns about merging this at this point? Do we need more testing?

joemiller · 2013-07-31T17:42:36Z

the test suite fails on 2 tests, at least on my box, with the cluster_proxy branch:

the suite passes on master branch.

$ git checkout cluster_proxy
Switched to branch "cluster_proxy"
[root@graphite01b statsd]# ./run_tests.sh

graphite_delete_counters_tests
✔ send_well_formed_posts
✔ send_malformed_post
✖ timers_are_valid

Assertion Message: statsd.numStats should be 1
AssertionError: statsd.numStats should be 1
    at Object.ok (/home/joe/statsd/node_modules/nodeunit/lib/types.js:83:39)
    at /home/joe/statsd/test/graphite_delete_counters_tests.js:218:18
    at null._onTimeout (/home/joe/statsd/test/graphite_delete_counters_tests.js:71:7)
    at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)

✖ counts_are_valid

Assertion Message: statsd.numStats should be 1
AssertionError: statsd.numStats should be 1
    at Object.ok (/home/joe/statsd/node_modules/nodeunit/lib/types.js:83:39)
    at /home/joe/statsd/test/graphite_delete_counters_tests.js:251:18
    at null._onTimeout (/home/joe/statsd/test/graphite_delete_counters_tests.js:71:7)
    at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)

- added a logger like stats.js - converted call to `util.log()` to use logger `l.log()` (would throw exception otherwise since util was not imported) - if no backends are available, the app would exit with an error calling `split()` on undefined `statsd_host` returned from `ring.get()`. Instead, catch this and report an error in the logs, but do not exit.

joemiller · 2013-07-31T18:39:34Z

I sent a PR against this branch to fix a few things I noticed in testing. #319

…proxy

Add logging to notify when a backend node is brought into the ring. Also there is a slight change in logic here as previously `ring.add(node)` would be called on every healthcheck (every 1s). This was probably harmless since it appears to be idempotent but I don"t believe it"s necessary and would make the logging of node additions excessive.

add additional logging

joemiller · 2013-08-01T14:27:54Z

I"m assuming it"s a good idea to use deleteIdleStats: true on the proxied statsd nodes for the case when a node leaves then re-joins the ring and you would want the "failover node" to stop sending updates? If so, we should probably note this in the docs since it is false by default.

draco2003 · 2013-08-01T14:45:23Z

In the current iteration it does require deleteIdleStats:true and makes sense to put in the docs.
I"m playing with ways of adding a flush and delete all stats type admin command that the proxy could call when rebuilding the ring. In our current configuration we can"t enable deleteIdleStats and therefore need a way to clear the known stats when the ring changes to avoid more than one instance sending the same keys.

joemiller · 2013-08-01T14:58:37Z

@draco2003 thanks for the headsup. What is the reason why you can"t use deleteIdleStats? I"m hesitant about enabling it since our current pystatsd-based setup that this will be replacing is not doing this.

Is your branch w/ admin flush command in a shareable state yet? I"d like to test.

joemiller · 2013-08-01T15:27:13Z

@draco2003 I didn"t notice that the statsd mgmt port already had some support for resetting stats, although it doesn"t support all stats yet. It does support the ones we commonly use.

Currently testing this new patch: joemiller@456860b

draco2003 · 2013-08-01T15:30:51Z

Not shareable at this point, but your patch looks good. Only issue i see is that you are doing a single write, which the admin interface will only see as one command and won"t reset the other two stat types i believe. You might have to send three separate writes.

joemiller · 2013-08-01T15:47:25Z

Yeah I noticed that as well after some quick testing. updated joemiller/statsd@etsy:cluster_proxy...cluster_proxy

draco2003 · 2013-08-01T16:38:16Z

Does it catch the error?
I think it might need to be:
if (response.indexOf("ERROR") > -1) {
or
if (response.indexOf("ERROR") >= 0) {

joemiller · 2013-08-02T17:16:11Z

thanks for the heads up. fixed

On Thu, Aug 1, 2013 at 9:38 AM, Dan Rowe [email protected] wrote:

Does it catch the error?
I think it might need to be:
if (response.indexOf("ERROR") > -1) {
or
if (response.indexOf("ERROR") >= 0) {

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-21950160
.

joemiller · 2013-08-02T18:58:28Z

FYI, it does not appear ipv6 is working. It looked like it was (no errors, things are running, metrics are flowin, etc) but upon closer inspection metrics were only being sent to the statsd instance on the same node as the statsd-proxy instance. Switching back to udp4 in the proxy config fixed the issue and metrics were flowing between nodes as expected. I"ll try to dive a little bit more into this, although ipv4 is fine in our particular use case.

joemiller · 2013-08-02T19:01:18Z

Also, to aid in anyone"s testing, here"s a quick one-liner I have been using to ensure that no more than one instance of statsd owns any particular metric:

# assumes statsd on 3 nodes and management port is 8201:

for i in graphite01b graphite02a graphite03a; do echo counters | nc $i 8201 | awk -F: "{print $1}" ; done | sort | uniq -c | sort -n | more

There should not be any metrics listed with a count > 1

joemiller · 2013-08-27T01:02:52Z

Been using this code for a few weeks now and it is working well. With a 4-node cluster, here is the distribution and rate of metrics coming in via statsd-proxy (the stats are actually collected via statsd itself, not statsd-proxy since it does not expose its internals as metrics like statsd):

The patch I proposed above to delete stats upon cluster membership change was not working as expected, perhaps due to a race condition so I went back to setting deleteIdleStats: true

lukevenediger · 2013-08-27T07:36:07Z

@joemiller are those 1 minute buckets?

joemiller · 2013-08-27T14:02:27Z

yes

On Tue, Aug 27, 2013 at 12:36 AM, Luke Venediger
[email protected]:

@joemiller https://github.com/joemiller are those 1 minute buckets?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-23317941
.

lukevenediger · 2013-08-27T14:43:29Z

@joemiller and I"m assuming the load was spread over two boxes?

joemiller · 2013-08-27T15:47:25Z

4 boxes. they are 8gb rackspace cloud vm"s

On Tue, Aug 27, 2013 at 7:43 AM, Luke Venediger [email protected]:

@joemiller https://github.com/joemiller and I"m assuming the load was
spread over two boxes?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-23341863
.

lukevenediger · 2013-08-27T16:05:39Z

@joemiller last question I promise! What did the box health stats look like? CPU, ram, oops?

lukevenediger · 2013-08-27T16:06:03Z

Oops -> iops

joemiller · 2013-08-27T17:32:03Z

sure, here are some more stats:

I like this chart, it shows the IO wait on the 4 nodes. Notice they are all steady except for graphite01 which has a noisy neighbor (CLOUD!!!!!):

Overall incoming metrics (this includes metrics from statsd and metrics from other sources. We have a lot of other sources of metrics besides statsd.)

Disk IO (this is tps as reported by sysstat)

carbon-cache cpu usage (as reported by carbon-cache"s internal metrics). Notice that 2 nodes hover at a steady 6% and the other 2 are at 12-14%. This is due to the difference in CPUs. The 2 nodes with lower CPU usage are using newer AMD Opterons (~6months old), the other 2 boxes are using Opterons that are nearly 5 years old. Welcome to the cloud.

I don"t really track memory usage because they don"t use much memory. graphite-web (under django with greenlets) is the biggest memory user, sometimes peaking to 1GB if it is pulling together a lot of datapoints to render.

mrtazz · 2013-10-01T15:51:20Z

thanks for sharing @joemiller this is very interesting. @draco2003 is there anything keeping us from merging this into master?

draco2003 · 2013-10-01T15:54:57Z

At this point, no. I think we should merge it in prior to rolling the next
NPM module.

On Tue, Oct 1, 2013 at 11:51 AM, Daniel Schauenberg <
[email protected]> wrote:

thanks for sharing @joemiller https://github.com/joemiller this is very
interesting. @draco2003 https://github.com/draco2003 is there anything
keeping us from merging this into master?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-25461937
.

joemiller · 2013-10-01T15:59:51Z

looking forward to the full release.!

we"re still using this in production with good success. I hope to do a
writeup on the cluster config, including notes on statsd-proxy, sometime in
the near future.

mrtazz · 2013-10-02T22:07:05Z

Awesome! @draco2003 can you merge it then? @joemiller looking forward to that write up :)

draco2003 · 2013-10-02T22:25:51Z

Sounds like a plan!
On Oct 2, 2013 6:07 PM, "Daniel Schauenberg" [email protected]
wrote:

Awesome! @draco2003 https://github.com/draco2003 can you merge it then?
@joemiller https://github.com/joemiller looking forward to that write
up :)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-25581278
.

Merging in the Statsd Cluster Proxy

draco2003 · 2013-10-03T13:00:57Z

@joemiller Mind sending over a pull request with your latest changes to the proxy and i"ll get those merged in as well.

joemiller · 2013-10-03T14:24:14Z

Sure. I sent over PR: #348

However, my attempt at resetting stats on ring-membership change was not
successful. Instead I"m running the statsd instances with deleteIdleStats: true right now.

On Thu, Oct 3, 2013 at 6:01 AM, Dan Rowe [email protected] wrote:

@joemiller https://github.com/joemiller Mind sending over a pull
request with your latest changes to the proxy and i"ll get those merged in
as well.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-25618275
.

draco2003 added 2 commits March 29, 2013 15:13

initial proxy P.O.C.

cd07259

Add hashring as optional dependency

821cfbc

mrtazz reviewed Apr 8, 2013
View reviewed changes

Update to use config library, make cacheSize a config option, and swi…

f014b01

…tch to using the health status

draco2003 mentioned this pull request Apr 30, 2013

StatsD using NodeJS Cluster Module #250

Closed

bootstrap.py checkout_titan() and others added 6 commits July 31, 2013 18:59

Merge branch "master" of https://github.com/etsy/statsd into cluster_…

79eb761

…proxy

Merge branch "master" into cluster_proxy

3be820a

remove unneeded util import

7f849d5

Merge branch "cluster-proxy-fixes" into cluster_proxy

953c80b

Merge pull request #320 from joemiller/cluster_proxy

ef77751

add additional logging

draco2003 added a commit that referenced this pull request Oct 3, 2013

Merge pull request #284 from etsy/cluster_proxy

1dd5244

Merging in the Statsd Cluster Proxy

draco2003 merged commit 1dd5244 into master Oct 3, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statsd Cluster Proxy #284

Statsd Cluster Proxy #284

draco2003 commented Mar 29, 2013

Dieterbe commented Apr 1, 2013

josegonzalez commented Apr 1, 2013

mrtazz Apr 8, 2013

mrtazz commented Apr 8, 2013

mrtazz commented May 12, 2013

joemiller commented Jul 31, 2013

joemiller commented Jul 31, 2013

joemiller commented Aug 1, 2013

draco2003 commented Aug 1, 2013

joemiller commented Aug 1, 2013

joemiller commented Aug 1, 2013

draco2003 commented Aug 1, 2013

joemiller commented Aug 1, 2013

draco2003 commented Aug 1, 2013

joemiller commented Aug 2, 2013

joemiller commented Aug 2, 2013

joemiller commented Aug 2, 2013

joemiller commented Aug 27, 2013

lukevenediger commented Aug 27, 2013

joemiller commented Aug 27, 2013

lukevenediger commented Aug 27, 2013

joemiller commented Aug 27, 2013

lukevenediger commented Aug 27, 2013

lukevenediger commented Aug 27, 2013

joemiller commented Aug 27, 2013

mrtazz commented Oct 1, 2013

draco2003 commented Oct 1, 2013

joemiller commented Oct 1, 2013

mrtazz commented Oct 2, 2013

draco2003 commented Oct 2, 2013

draco2003 commented Oct 3, 2013

joemiller commented Oct 3, 2013

Statsd Cluster Proxy #284

Statsd Cluster Proxy #284

Conversation

draco2003 commented Mar 29, 2013

Dieterbe commented Apr 1, 2013

josegonzalez commented Apr 1, 2013

mrtazz Apr 8, 2013

Choose a reason for hiding this comment

mrtazz commented Apr 8, 2013

mrtazz commented May 12, 2013

joemiller commented Jul 31, 2013

joemiller commented Jul 31, 2013

joemiller commented Aug 1, 2013

draco2003 commented Aug 1, 2013

joemiller commented Aug 1, 2013

joemiller commented Aug 1, 2013

draco2003 commented Aug 1, 2013

joemiller commented Aug 1, 2013

draco2003 commented Aug 1, 2013

joemiller commented Aug 2, 2013

joemiller commented Aug 2, 2013

joemiller commented Aug 2, 2013

joemiller commented Aug 27, 2013

lukevenediger commented Aug 27, 2013

joemiller commented Aug 27, 2013

lukevenediger commented Aug 27, 2013

joemiller commented Aug 27, 2013

lukevenediger commented Aug 27, 2013

lukevenediger commented Aug 27, 2013

joemiller commented Aug 27, 2013

mrtazz commented Oct 1, 2013

draco2003 commented Oct 1, 2013

joemiller commented Oct 1, 2013

mrtazz commented Oct 2, 2013

draco2003 commented Oct 2, 2013

draco2003 commented Oct 3, 2013

joemiller commented Oct 3, 2013