-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev.icinga.com #8714] Add priority queue for disconnect/programstatus update events #2746
Comments
Updated by TheSerapher on 2015-03-12 13:07:28 00:00 This is just a guess, but this could explain the Deadlocks and Lock Timeouts we have using IDO DB with two masters running. If the real master is not updating it's status properly, the secondary master (using the status table if the actual master is active or not) may enable the IDO connection since the active master has not reported as being active in time? |
Updated by TheSerapher on 2015-03-12 13:12:12 00:00 As I wrote that, the master finally "caught up" and has updated it's status:
|
Updated by TheSerapher on 2015-03-12 15:57:02 00:00 Did some strace on the server to see if it's inserting to that table, even though it did, the timestamps being inserted are already way off:
Checking against the table, the FROM_UNIXTIME matches the inserted record:
This seems to point at a full SQL Queue in the Master process. How can I confirm or check the existing master queues for outstanding queries? Is there a way to improve this? |
Updated by mfriedrich on 2015-03-12 16:06:51 00:00 The `icinga` check provides performance data for queued ido items inside the feature objects. Might be a good idea to check on that. |
Updated by TheSerapher on 2015-03-12 16:25:32 00:00 dnsmichi wrote:
Thanks, I will certainly check that out tomorrow. Hopefully this will give some useful information. |
Updated by TheSerapher on 2015-03-13 08:18:18 00:00
Scratch that, Grep FTW :-) |
Updated by mfriedrich on 2015-03-13 08:36:33 00:00 There's a search on docs.icinga.org as well now. |
Updated by TheSerapher on 2015-03-13 08:51:55 00:00 So we enabled this module but it doesn't look like the output is telling us how long the IDO SQL queue is:
At the same time, we have now reduced the amount of checkers. The idea behind this: When starting checkers, we see high loads and (assuming) all service checks being triggered (see https://dev.icinga.org/issues/8670) which we'd assume are sent to the master for processing results and inserting them into the DB. To reduce this initial load on the master, we shut down the entire cluster, started only the master and added two hardware checkers into it. As expected those triggered the service checks and ran under higher load, but this time the master was able to keep up and since then status updates have been fine. Over night, we had a 3 - 4 minute gap between actual time and master updates. Looks okay now:
I kept reloading the status information to see if there is any delay adding up but it looked okay so far. One time it was until 20s for it to refresh back down to 5s, but then caught up completely again. Re-adding one additional checker and checking the status updates again, we are seeing the same issues again:
This is a smaller VM than our hardware but this should not affect the status update intervals on the master. I am not sure what else I can do to support here, but it's pretty apparent that something isn't working right with many checker nodes active. In this current state, we can't switch our monitoring to Icinga2 until we can see proper performance from our masters. EDIT: Three checkers seems to be the magical border. Once we add the third checker, the master updates start lagging and getting worse. Removing the checker doesn't recover the status updates but at least they are not getting worse anymore. Re-starting the master does not fix this - maybe because the still running checkers are posting check results once it's back online. EDIT2: Started master node with debug log enabled, it seems that a starting master is inserting the entire system state into the IDO DB when starting up. No checkers have been turned on but we see tons of queries (INSERT/UPDATE):
After the startup finally completes (which takes a while with that many hosts and services) the master is running and starts updating the status table with correct timestamps:
Adding a single checker confirms what I suspected, it starts triggering the plugins increasing system load and sending all results to the master. This again will take quite some time. At this time, we are not seeing any updates to the status table starting to cause the delays on the interface too. Once the checker completes this initial booting phase, the master updates the status table again as expected. Since the active master is not updating the status, the passive one assumes it's down and would take over? We have disabled the secondary master since we had issues with that as soon as we added our service checks. |
Updated by gbeutner on 2015-03-13 11:16:03 00:00 This might not solve your problem but maybe it makes diagnosing this a bit easier: The latest snapshots have a new check for this: Graphite: Config:
|
Updated by TheSerapher on 2015-03-13 11:19:39 00:00 Excellent, we will look at this in more detail starting Monday. So far one thing that's odd is the fact that the master is not doing any updates as soon as a client connects via API and pumps initial data. I am sure that's not intended since it would cause a secondary master to take over even there is no reason to do so. |
Updated by gbeutner on 2015-03-13 12:00:54 00:00 As far as I can see we're updating the icinga_customvariablestatus table even though the custom variables haven't been changed. This seems to cause an insane amount of redundant queries. |
Updated by TheSerapher on 2015-03-13 12:40:51 00:00 gunnarbeutner wrote:
That's a great start. Granted, our master may just be not powerful enough but looking at the load on it ( 13:29:02 up 7 days, 5:19, 1 user, load average: 0.22, 0.20, 0.22 ) and CPU usage (iostat: 93.07% IDLE) this should really not be an issue. Just for fun I will run the master in debug mode and check the queries again. |
Updated by gbeutner on 2015-03-13 12:50:24 00:00
|
Updated by TheSerapher on 2015-03-13 13:04:59 00:00 Here my quick and dirty run over a debug log of about 4minutes 30 seconds when starting up the master:
And the same filter with master being active for 10 minutes before checking and gathering 4minutes 30 seconds of data:
Maybe this helps too :-) |
Updated by gbeutner on 2015-03-13 13:11:06 00:00 I've triggered a manual rebuild for the snapshot packages (https://build.icinga.org/jenkins/job/icinga2-rpm-snapshot-packages/). Can you re-test this problem with the latest snapshots (once the build is done)? |
Updated by TheSerapher on 2015-03-13 13:14:23 00:00 gunnarbeutner wrote:
Found them, will test! |
Updated by TheSerapher on 2015-03-13 13:40:49 00:00 Startup phase prior to running the first update to icinga_programstatus. The time we see the SELECT on the table until we write the first time was about 3.5 Minutes:
Runtime information about 4 minutes after first status update:
Looks way, way better! Nice work :-) Now back to the initial startup issues. I would expect a master to do status updates to the program status table no matter what else is going on. In the end, that process will determine if a secondary master will take over or not. If we are not putting any status updates into the database until the initial startup is completed due to wiping and re-filling the tables, another master that may already be running could take over leading into lock waits or deadlocks? |
Updated by mfriedrich on 2015-03-13 13:52:38 00:00
|
Updated by mfriedrich on 2015-03-13 13:58:34 00:00 TheSerapher wrote:
In Icinga 1.x we had the 'config_dump_in_progress' column as workaround to signal the interfaces that something blocks the program status updates. Your description sounds familiar in terms of the query queue blocking the program status updates. Since we rely on the program status update time on DB IDO HA mode, this becomes a critical problem, you're absolutely right about that. Probably the program status updates should be sent through a different connection (or a priority queue) always reaching the database in time. Not sure about the right implementation though. |
Updated by TheSerapher on 2015-03-13 14:04:57 00:00 I can't get the IDO plugin to work:
EDIT Being silly again trying to run this on a checker, so moved to the master but it's in state pending. Will try to enable the checker feature on our master and hope it checks properly then. |
Updated by TheSerapher on 2015-03-13 14:30:52 00:00 Here the result from the check command, this is after the server restart and as soon as the first status update was completed:
51k outstanding queries :-/ |
Updated by TheSerapher on 2015-03-13 14:35:06 00:00 And here just a few intervals later:
So it is save to save that a server/cluster restart will cause issues when it comes to status updates. Is there any way that these status updates are running outside the regular SQL Queue and that these run all the time as soon as the IDO connection is established? That would help regarding active/passive master setups. And what about those service checks on each node that are sent to the master after reloads and restarts? Could these maybe skipped in a clustered scenario until a master asks for an active check at a checker node? Just throwing in ideas here. We really want this setup to work since it's far superior to any other setup we had so far. Ironing those issues out would make this system usable for us! |
Updated by TheSerapher on 2015-03-13 14:49:15 00:00 Tried to do a failover by restarting Icinga2 on the active master. As expected, the master stopped running status updates, after 1 minute the passive master took over while the old active was still starting up and doing queries. New Master
It looks like the master/master setup is not working as soon as we are dealing with long startup times and initialisation phases. Even a simple restart or reload on an active or passive master can cause the lock waits. And these lock waits even make kinda sense, we don't want the restarting master to delete all entries from all startup tables since the new master is updating those entries! Hopefully there will be a solution to this. |
Updated by TheSerapher on 2015-03-13 15:14:00 00:00 Totally missed your update there dnsmichi! I agree that rhis is indeed a critical issue and makes clusters with a lot of services and checkers difficult. As far as I can tell we have two issues here
I think for the first issue, a high priority queue would be a good idea. This can also be used for other important cross system messages that rely on up to date information. On the second issue, this is way more complicated but would benefit cluster restart times and recoveries. I can see why data is wiped and inserted (service changes or removals)) but maybe a first step could be to batch inserts. Instead of a single query for every insert make it one query inserting many rows. Or find a way to compare database information with actual configuration objects and remove/change those in the DB trust haven been modified/removed from the Config. |
Updated by mfriedrich on 2015-03-13 15:32:15 00:00 TheSerapher wrote:
Feel free to create a patch :)
I don't see this as part of the original problem. That's by design, and if you consider changing or discussing it, please open a new issue. The config dump itself, and also initial state dump is a different story. Imho fixing the status updates not reaching the programstatus update table, as well as other performance tweaks (yet-to-be-found) would help solve the problem. Other than that, we could just signal the query processing thread on the queue to fire a new program status update before executing any other query. Workaround for your problem - an ugly one - external cronjob on the box updating the status_update_time and nodename from a separate connection (but only on the active master, and control that with pacemaker or similar). |
Updated by TheSerapher on 2015-03-13 15:38:33 00:00 If I would be able to code anything in C/C** I would jump in - but I really don't and don't wanna mess things up ;-) I agree that the status update being handled properly would alread help a lot. But what about a proper failover too? Let that init phase take a while but at least fail over to a second nose if available during that time. Masters could also use the API to update their states and only if both API and IDO status are negative, do the failover. Yet I think the initial load needs addressing since no updates will be done until the load completes. Will make a separate issue for that then. |
Updated by gbeutner on 2015-03-13 19:30:35 00:00
|
Updated by gbeutner on 2015-03-13 19:32:37 00:00 I guess we could prioritize certain kinds of queries. On another note #8738 should improve query performance for IDO (MySQL only for now) significantly as it eliminates round-trips for most database queries. |
Updated by TheSerapher on 2015-03-13 19:43:38 00:00 If you got another snapshot I will try to give it a whirl : The former would probably make it compliant with Postgres as well while doing the same thing. EDIT Word |
Updated by gbeutner on 2015-03-13 20:34:47 00:00 FYI the Ubuntu trusty packages are built for each commit. So by now you should already have updated packages on packages.icinga.org. Implementing this for PostgreSQL will probably be a bit more difficult unfortunately because they don't seem to support multiple statements (well, they do, but you only get a result object for the last statement which makes things a bit awkward). |
Updated by mfriedrich on 2015-03-14 09:25:54 00:00 Reducing the config dump to a minimum is always desired, but even if we solve it for pgsql as well, it wouldn't solve the original issue with the db ido ha feature failure. I'm not sure if a priority queue is the best solution here, although it will indicate when the entire connection is blocking (i.e. db server slow). A different idea would be to run programststatus table updates in a separate thread and its own connection. It has to wait for the instance_id, and could check for the query rate != 0 (and a signal that config dump is running) for example. As long as icinga2 is running the table gets updated and illustrates what the main process is doing. Not sure though on the impact that has for other tables and delayed updates (interfaces might rely on programstatus only). Web2 visualizes that by last/next check and interval minus now() - so not a problem with stale data. |
Updated by TheSerapher on 2015-03-14 09:35:25 00:00 Yes the Config dump is a desperate issue. Haven't had a chance to open a new ticket yet. I think for the status update it should be a combination of high priority IDO queries for masters and the API in the Config-ha-zone. We have a heartbeat system between masters in place using the API already, why not expand that to focus on the API as means for masters to communicate state information. Using IDO could be an addition to that to avoid a split brain scenario in case the API failed for some reason. And yes, the status table is used by frontends so it still would require updates unless frontends can retrieve that information from a process directly (Public API?). |
Updated by TheSerapher on 2015-03-16 08:34:59 00:00 Created a ticket for server startup config dump: https://dev.icinga.org/issues/8756 |
Updated by TheSerapher on 2015-03-16 10:26:06 00:00 gunnarbeutner wrote:
Updated masters to latest snapshot and seeing the UPDATE ...; UPDATE ...; queries in the strace now. |
Updated by mfriedrich on 2015-03-19 09:53:28 00:00
|
Updated by mfriedrich on 2015-03-23 13:58:15 00:00
|
Updated by mfriedrich on 2015-04-15 07:19:34 00:00
|
Updated by mfriedrich on 2015-04-15 07:22:21 00:00 We'll re-schedule these issues in favor of tagging 2.3.4 in the next couple of days. |
Updated by gbeutner on 2015-05-13 07:53:10 00:00
|
Updated by mfriedrich on 2015-06-23 13:26:54 00:00
|
Updated by dgoetz on 2015-07-09 09:39:10 00:00 I see the same problem here especially caused by the customvariables which get dropped and refilled during startup/reload. |
Updated by mfriedrich on 2015-10-28 08:48:56 00:00
|
Updated by mfriedrich on 2015-11-03 09:39:32 00:00
|
Updated by mfrosch on 2015-12-02 18:25:23 00:00 I think it might be reasonable to add several update types to the IDO:
status should be able to merge updates, e.g. not every update is queued, but only the fact that a update type has to be refreshed. To merge multiple updates into one, when still in queue. |
Updated by Anonymous on 2015-12-10 16:10:03 00:00
Applied in changeset 15ca998. |
Updated by gbeutner on 2015-12-10 16:17:26 00:00
Note that this patch does not merge duplicate queries. This is definitely something we should implement though (i.e., next week). |
Updated by gbeutner on 2015-12-10 16:17:57 00:00
|
Updated by mfriedrich on 2015-12-14 09:22:09 00:00
|
Updated by mfriedrich on 2015-12-15 09:39:52 00:00
|
Updated by mfriedrich on 2015-12-15 10:54:36 00:00
|
Updated by mfriedrich on 2015-12-15 12:00:25 00:00
|
Updated by mfriedrich on 2015-12-15 16:08:59 00:00
|
Updated by mfriedrich on 2015-12-18 09:47:09 00:00
|
Updated by mfriedrich on 2016-02-08 10:02:04 00:00
|
Updated by gbeutner on 2016-02-23 09:58:14 00:00
|
This issue has been migrated from Redmine: https://dev.icinga.com/issues/8714
Created by TheSerapher on 2015-03-12 13:05:24 00:00
Assignee: gbeutner
Status: Resolved (closed on 2015-12-10 16:10:03 00:00)
Target Version: 2.4.2
Last Update: 2016-02-23 09:58:14 00:00 (in Redmine)
The master is not updating the IDO table and set itself as running on time (which may explain why we can't run two masters at the same time):
Interestingly the Last Status Update time is jumping down sometimes. Only by a few seconds though, as if the last update was done but with a large delay. Running the query confirms that the time is off by about 2.5 minutes:
What would be the cause? Is the master not running the SQL queue fast enough so updates are inserted with a delay? We have about 24.000 service checks on 1500 hosts that are executed by 4 checkers. Maybe this causes issues?
I was able to confirm that the database has no significant load or any system issues.
Changesets
2015-12-10 16:06:00 00:00 by (unknown) 15ca998
2015-12-14 09:34:12 00:00 by (unknown) 372cf07
2015-12-15 10:58:50 00:00 by mfriedrich da3d210
2016-02-23 08:09:06 00:00 by (unknown) a40fc65
2016-02-23 08:09:06 00:00 by (unknown) 02184ad
2016-02-23 08:09:06 00:00 by mfriedrich 2bc1d32
Relations:
The text was updated successfully, but these errors were encountered: