Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statsd Cluster Proxy #284

Merged
merged 10 commits into from
Oct 3, 2013
Merged

Statsd Cluster Proxy #284

merged 10 commits into from
Oct 3, 2013

Conversation

draco2003
Copy link
Contributor

Initial P.O.C. of a statsd cluster proxy

Was initially built at Monitorama Hackathon

monitorama/hackathon#56

@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 1, 2013

😍

@josegonzalez
Copy link

❤️

, net = require('net')
, events = require('events')
, hashring = require('hashring')
, config = require('./proxyConfig');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to use the config module here? (and adapt it where we need to)

@mrtazz
Copy link
Member

mrtazz commented Apr 8, 2013

I think the function (regular and anonymous in the event listener) could use better/more formal documentation. But otherwise 🚢 :shipit:

@mrtazz
Copy link
Member

mrtazz commented May 12, 2013

@draco2003 do we have any concerns about merging this at this point? Do we need more testing?

@joemiller
Copy link
Contributor

the test suite fails on 2 tests, at least on my box, with the cluster_proxy branch:

the suite passes on master branch.

$ git checkout cluster_proxy
Switched to branch "cluster_proxy"
[root@graphite01b statsd]# ./run_tests.sh

graphite_delete_counters_tests
✔ send_well_formed_posts
✔ send_malformed_post
✖ timers_are_valid

Assertion Message: statsd.numStats should be 1
AssertionError: statsd.numStats should be 1
    at Object.ok (/home/joe/statsd/node_modules/nodeunit/lib/types.js:83:39)
    at /home/joe/statsd/test/graphite_delete_counters_tests.js:218:18
    at null._onTimeout (/home/joe/statsd/test/graphite_delete_counters_tests.js:71:7)
    at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)

✖ counts_are_valid

Assertion Message: statsd.numStats should be 1
AssertionError: statsd.numStats should be 1
    at Object.ok (/home/joe/statsd/node_modules/nodeunit/lib/types.js:83:39)
    at /home/joe/statsd/test/graphite_delete_counters_tests.js:251:18
    at null._onTimeout (/home/joe/statsd/test/graphite_delete_counters_tests.js:71:7)
    at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)

- added a logger like stats.js
- converted call to `util.log()` to use logger `l.log()` (would throw exception otherwise since util was not imported)
- if no backends are available, the app would exit with an error calling `split()` on undefined `statsd_host` returned from `ring.get()`. Instead, catch this and report an error in the logs, but do not exit.
@joemiller
Copy link
Contributor

I sent a PR against this branch to fix a few things I noticed in testing. #319

bootstrap.py checkout_titan() and others added 6 commits July 31, 2013 18:59
Add logging to notify when a backend node is brought into the ring. Also there is a slight change in logic here as previously `ring.add(node)` would be called on every healthcheck (every 1s). This was probably harmless since it appears to be idempotent but I don"t believe it"s necessary and would make the logging of node additions excessive.
@joemiller
Copy link
Contributor

I"m assuming it"s a good idea to use deleteIdleStats: true on the proxied statsd nodes for the case when a node leaves then re-joins the ring and you would want the "failover node" to stop sending updates? If so, we should probably note this in the docs since it is false by default.

@draco2003
Copy link
Contributor Author

In the current iteration it does require deleteIdleStats:true and makes sense to put in the docs.
I"m playing with ways of adding a flush and delete all stats type admin command that the proxy could call when rebuilding the ring. In our current configuration we can"t enable deleteIdleStats and therefore need a way to clear the known stats when the ring changes to avoid more than one instance sending the same keys.

@joemiller
Copy link
Contributor

@draco2003 thanks for the headsup. What is the reason why you can"t use deleteIdleStats? I"m hesitant about enabling it since our current pystatsd-based setup that this will be replacing is not doing this.

Is your branch w/ admin flush command in a shareable state yet? I"d like to test.

@joemiller
Copy link
Contributor

@draco2003 I didn"t notice that the statsd mgmt port already had some support for resetting stats, although it doesn"t support all stats yet. It does support the ones we commonly use.

Currently testing this new patch: joemiller@456860b

@draco2003
Copy link
Contributor Author

Not shareable at this point, but your patch looks good. Only issue i see is that you are doing a single write, which the admin interface will only see as one command and won"t reset the other two stat types i believe. You might have to send three separate writes.

@joemiller
Copy link
Contributor

Yeah I noticed that as well after some quick testing. updated joemiller/statsd@etsy:cluster_proxy...cluster_proxy

@draco2003
Copy link
Contributor Author

Does it catch the error?
I think it might need to be:
if (response.indexOf("ERROR") > -1) {
or
if (response.indexOf("ERROR") >= 0) {

@joemiller
Copy link
Contributor

thanks for the heads up. fixed

On Thu, Aug 1, 2013 at 9:38 AM, Dan Rowe [email protected] wrote:

Does it catch the error?
I think it might need to be:
if (response.indexOf("ERROR") > -1) {
or
if (response.indexOf("ERROR") >= 0) {


Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-21950160
.

@joemiller
Copy link
Contributor

FYI, it does not appear ipv6 is working. It looked like it was (no errors, things are running, metrics are flowin, etc) but upon closer inspection metrics were only being sent to the statsd instance on the same node as the statsd-proxy instance. Switching back to udp4 in the proxy config fixed the issue and metrics were flowing between nodes as expected. I"ll try to dive a little bit more into this, although ipv4 is fine in our particular use case.

@joemiller
Copy link
Contributor

Also, to aid in anyone"s testing, here"s a quick one-liner I have been using to ensure that no more than one instance of statsd owns any particular metric:

# assumes statsd on 3 nodes and management port is 8201:

for i in graphite01b graphite02a graphite03a; do echo counters | nc $i 8201 | awk -F: "{print $1}" ; done | sort | uniq -c | sort -n | more

There should not be any metrics listed with a count > 1

@joemiller
Copy link
Contributor

Been using this code for a few weeks now and it is working well. With a 4-node cluster, here is the distribution and rate of metrics coming in via statsd-proxy (the stats are actually collected via statsd itself, not statsd-proxy since it does not expose its internals as metrics like statsd):

pantheon_metrics

The patch I proposed above to delete stats upon cluster membership change was not working as expected, perhaps due to a race condition so I went back to setting deleteIdleStats: true

@lukevenediger
Copy link

@joemiller are those 1 minute buckets?

@joemiller
Copy link
Contributor

yes

On Tue, Aug 27, 2013 at 12:36 AM, Luke Venediger
[email protected]:

@joemiller https://github.com/joemiller are those 1 minute buckets?


Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-23317941
.

@lukevenediger
Copy link

@joemiller and I"m assuming the load was spread over two boxes?

@joemiller
Copy link
Contributor

4 boxes. they are 8gb rackspace cloud vm"s

On Tue, Aug 27, 2013 at 7:43 AM, Luke Venediger [email protected]:

@joemiller https://github.com/joemiller and I"m assuming the load was
spread over two boxes?


Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-23341863
.

@lukevenediger
Copy link

@joemiller last question I promise! What did the box health stats look like? CPU, ram, oops?

@lukevenediger
Copy link

Oops -> iops

@joemiller
Copy link
Contributor

sure, here are some more stats:

I like this chart, it shows the IO wait on the 4 nodes. Notice they are all steady except for graphite01 which has a noisy neighbor (CLOUD!!!!!):

pantheon_metrics

Overall incoming metrics (this includes metrics from statsd and metrics from other sources. We have a lot of other sources of metrics besides statsd.)

pantheon_metrics

Disk IO (this is tps as reported by sysstat)

pantheon_metrics

carbon-cache cpu usage (as reported by carbon-cache"s internal metrics). Notice that 2 nodes hover at a steady 6% and the other 2 are at 12-14%. This is due to the difference in CPUs. The 2 nodes with lower CPU usage are using newer AMD Opterons (~6months old), the other 2 boxes are using Opterons that are nearly 5 years old. Welcome to the cloud.
pantheon_metrics

I don"t really track memory usage because they don"t use much memory. graphite-web (under django with greenlets) is the biggest memory user, sometimes peaking to 1GB if it is pulling together a lot of datapoints to render.

@mrtazz
Copy link
Member

mrtazz commented Oct 1, 2013

thanks for sharing @joemiller this is very interesting. @draco2003 is there anything keeping us from merging this into master?

@draco2003
Copy link
Contributor Author

At this point, no. I think we should merge it in prior to rolling the next
NPM module.

On Tue, Oct 1, 2013 at 11:51 AM, Daniel Schauenberg <
[email protected]> wrote:

thanks for sharing @joemiller https://github.com/joemiller this is very
interesting. @draco2003 https://github.com/draco2003 is there anything
keeping us from merging this into master?


Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-25461937
.

@joemiller
Copy link
Contributor

looking forward to the full release.!

we"re still using this in production with good success. I hope to do a
writeup on the cluster config, including notes on statsd-proxy, sometime in
the near future.

@mrtazz
Copy link
Member

mrtazz commented Oct 2, 2013

Awesome! @draco2003 can you merge it then? @joemiller looking forward to that write up :)

@draco2003
Copy link
Contributor Author

Sounds like a plan!
On Oct 2, 2013 6:07 PM, "Daniel Schauenberg" [email protected]
wrote:

Awesome! @draco2003 https://github.com/draco2003 can you merge it then?
@joemiller https://github.com/joemiller looking forward to that write
up :)


Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-25581278
.

draco2003 added a commit that referenced this pull request Oct 3, 2013
Merging in the Statsd Cluster Proxy
@draco2003 draco2003 merged commit 1dd5244 into master Oct 3, 2013
@draco2003
Copy link
Contributor Author

@joemiller Mind sending over a pull request with your latest changes to the proxy and i"ll get those merged in as well.

@joemiller
Copy link
Contributor

Sure. I sent over PR: #348

However, my attempt at resetting stats on ring-membership change was not
successful. Instead I"m running the statsd instances with deleteIdleStats: true right now.

On Thu, Oct 3, 2013 at 6:01 AM, Dan Rowe [email protected] wrote:

@joemiller https://github.com/joemiller Mind sending over a pull
request with your latest changes to the proxy and i"ll get those merged in
as well.


Reply to this email directly or view it on GitHubhttps://github.com//pull/284#issuecomment-25618275
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants