Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flapping tests and bugs #5720

Closed
dnsmichi opened this issue Nov 5, 2017 · 18 comments · Fixed by #5732
Closed

Flapping tests and bugs #5720

dnsmichi opened this issue Nov 5, 2017 · 18 comments · Fixed by #5732
Labels
area/checks Check execution and results blocker Blocks a release or needs immediate attention bug Something isn't working
Milestone

Comments

@dnsmichi
Copy link
Contributor

dnsmichi commented Nov 5, 2017

Expected Behavior

No notifications are sent if flapping is disabled.

Current Behavior

enable_flapping is false, but Icinga 2 sends Flapping Stop notifications.

This diff does not take into account if flapping detection is enabled.

        /* Flapping start/end notifications */
-       if (send_notification && !was_flapping && is_flapping) {
        if (!in_downtime && !was_flapping && is_flapping) {

The origin of the changed behaviour is coming from

        UpdateFlappingStatus(old_state != cr->GetState());

-       if (GetStateType() == StateTypeHard)
-               UpdateFlappingStatus(stateChange);

where Flapping is updated every time. This would allow the above code change to work, but also here GetEnableFlapping() is missing.

Both code parts should explicitly check for GetEnableFlapping().

Possible Solution

Disable flapping detection algorithm if explicitly disabled.

Furthermore, test if flapping start is properly sent. This doesn't look like this, I don't have any downtimes set.

Steps to Reproduce (for bugs)

  1. Run latest snapshot build
  2. Add a service check which uses random as check command and low *_interval settings
# icinga2 object list --name 'graylog-host'
Object 'graylog-host' of type 'Host':
  % declared in '/etc/icinga2/demo/graylog2-demo.conf', lines 46:1-46:26
  * __name = "graylog-host"
  * action_url = ""
  * address = "127.0.0.1"
    % = modified in '/etc/icinga2/demo/graylog2-demo.conf', lines 47:3-47:23
  * address6 = ""
  * check_command = "hostalive"
    % = modified in '/etc/icinga2/demo/graylog2-demo.conf', lines 48:3-48:29
  * check_interval = 300
  * check_period = ""
  * check_timeout = null
  * command_endpoint = ""
  * display_name = "graylog-host"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 3
  * name = "graylog-host"
  * notes = ""
  * notes_url = ""
  * package = "_etc"
  * retry_interval = 60
  * source_location
    * first_column = 1
    * first_line = 46
    * last_column = 26
    * last_line = 46
    * path = "/etc/icinga2/demo/graylog2-demo.conf"
  * templates = [ "graylog-host" ]
    % = modified in '/etc/icinga2/demo/graylog2-demo.conf', lines 46:1-46:26
  * type = "Host"
  * vars
    * notification_groups = [ "osmc" ]
      % = modified in '/etc/icinga2/demo/graylog2-demo.conf', lines 56:3-56:39
    * random = [ 1, 2, 3, 4, 5 ]
      % = modified in '/etc/icinga2/demo/graylog2-demo.conf', lines 54:3-54:32
    * streams
      * icinga2
        % = modified in '/etc/icinga2/demo/graylog2-demo.conf', lines 50:3-52:3
        * graylog2_stream_id = "54610d26e4b086382bbfab0f"
  * volatile = false
  * zone = ""

Context

This floods anyone with notifications, especially to notice the web's history tab.

screen shot 2017-11-05 at 17 57 45

Your Environment

  • Version used (icinga2 --version):
# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: v2.7.1-306-g36fddaf)

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2 : GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-693.5.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: unknown
  • Enabled features (icinga2 feature list):
Disabled features: command compatlog debuglog elastic graphite influxdb livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker gelf ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About): v2.4.2, ee7fa2c48ca61f1d8d85c84f26e5b5abc5e8b22e
@dnsmichi dnsmichi added bug Something isn't working area/checks Check execution and results blocker Blocks a release or needs immediate attention labels Nov 5, 2017
@dnsmichi dnsmichi added this to the 2.8.0 milestone Nov 5, 2017
@Crunsher
Copy link
Contributor

Crunsher commented Nov 6, 2017

Updating the flapping status with every check makes sense imo. It's not a big operation and makes an object not require 20 check results to compute accurate results.

About Flapping Notifications: I could not find any issue with it. All notifications were sent when they ought to and were not send in Downtimes. Is it possible you ran into the bug caused by endianness?

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 6, 2017

I'm thinking as a user here - if I explicitly disable flapping, I don't expect Icinga to calculate some flapping stats (which is kind of an overhead for each check result result).

About flapping notifications - this seems to be a long running bug (not checking it at all), and we should fix it. I remember seeing that in a different issue, but flapping has explicitly enabled over there.

Previously it was somehow hidden, since send_notification was not set all the time. Right now, each time flapping is detected, OnNotifyFlapping() is called.

Try it yourself with a local random check in a frequent interval. Or you'll do a vagrant up inside the icinga2x-graylog box which runs into this bug.

@Crunsher
Copy link
Contributor

Crunsher commented Nov 6, 2017

The overhead should be negligible and this circumvents problems with re-enabling flapping, otherwise we'd have to clear out flapping values and state changes when flapping is toggled.

I don't receive any Flapping Notifications when flapping is disabled for a Host...

Edit: I see now the ApiEvent is probably being triggered

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 6, 2017

It also triggers DbEvents and flapping history. I guess I was looking at the wrong type in Icinga Web 2's history tab, sorry.

Crunsher added a commit that referenced this issue Nov 7, 2017
@dnsmichi dnsmichi reopened this Nov 7, 2017
@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 7, 2017

The bug I'm seeing relates to #5086 where enable_flapping=true. In my case I do hit it without enabling flapping. I'll investigate further while waiting for the other patch :)

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 7, 2017

object Host "flapping" {
  check_interval = 1s
  retry_interval = 1s

  check_command = "random"
}

object NotificationCommand "flapping" {
  command = [ "/bin/true" ]
}

apply Notification "flapping-notification" to Host {

  command = "flapping"
  users = [ "icingaadmin" ]

  assign where host.name == "flapping"
}

results in this Icinga Web 2 history. This is not about notifications. It causes entries in the flappinghistory DB IDO table.

screen shot 2017-11-07 at 14 19 18

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 7, 2017

I think the problem is IsFlapping() which returns false in case Flapping is disabled (my default).

I cannot confirm it from the logs though (un-commented the debug line).

[2017-11-07 14:32:30  0100] debug/CheckerComponent: Executing check for 'flapping'
[2017-11-07 14:32:30  0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Scheduling execute query task, type 2, table 'hoststatus'.
[2017-11-07 14:32:30  0100] debug/DbEvents: add log entry history for 'flapping'
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Scheduling execute query task, type 1, table 'logentries'.
[2017-11-07 14:32:30  0100] debug/DbEvents: add flapping history for 'flapping'
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Scheduling execute query task, type 1, table 'flappinghistory'.
[2017-11-07 14:32:30  0100] information/Checkable: Flapping: Checkable flapping was: 0 is: 0 threshold low: 25 threshold high: 30% current: 78.7%.
[2017-11-07 14:32:30  0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-11-07 14:32:30  0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2017-11-07 14:32:30  0100] debug/DbEvents: add log entry history for 'flapping'
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Scheduling execute query task, type 1, table 'logentries'.
[2017-11-07 14:32:30  0100] debug/DbEvents: add checkable check history for 'flapping'
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Scheduling execute query task, type 1, table 'hostchecks'.
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Scheduling execute query task, type 3, table 'hoststatus'.
[2017-11-07 14:32:30  0100] debug/DbEvents: add state change history for 'flapping'
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Scheduling execute query task, type 1, table 'statehistory'.
[2017-11-07 14:32:30  0100] notice/Checkable: State Change: Checkable 'flapping' soft state change from DOWN to DOWN detected.
[2017-11-07 14:32:30  0100] debug/CheckerComponent: Check finished for object 'flapping'
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Scheduling new transaction and finishing async queries.
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET is_flapping = '0',  percent_state_change = '78.700000' WHERE host_object_id = 543 AND instance_id = 1
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_flappinghistory (endpoint_object_id, event_time, event_time_usec, event_type, flapping_type, high_threshold, instance_id, low_threshold, object_id, percent_state_change, reason_type) VALUES (214, FROM_UNIXTIME(1510061550), '599776', '1001', '0', '30', 1, '25', 543, '78.700000', '1')
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0',  active_checks_enabled = '1',  check_command = 'random',  check_source = 'mbmif.int.netways.de',  check_type = '0',  current_check_attempt = '2',  current_notification_number = '0',  current_state = '1',  endpoint_object_id = 214,  event_handler = '',  event_handler_enabled = '1',  execution_time = '0.000243',  flap_detection_enabled = '0',  has_been_checked = '1',  host_object_id = 543,  instance_id = 1,  is_flapping = '0',  is_reachable = '1',  last_check = FROM_UNIXTIME(1510061550),  last_hard_state = '0',  last_hard_state_change = FROM_UNIXTIME(1510061547),  last_notification = FROM_UNIXTIME(1510061396),  last_state_change = FROM_UNIXTIME(1510061549),  last_time_down = FROM_UNIXTIME(1510061550),  last_time_up = FROM_UNIXTIME(1510061548),  latency = '0.001383',  long_output = '',  max_check_attempts = '3',  next_check = FROM_UNIXTIME(1510061551),  next_notification = FROM_UNIXTIME(1510063193),  normal_check_interval = '0.016667',  notifications_enabled = '1',  original_attributes = 'null',  output = 'Hello from mbmif.int.netways.de',  passive_checks_enabled = '1',  percent_state_change = '78.700000',  perfdata = 'time=1510061550.599266',  problem_has_been_acknowledged = '0',  process_performance_data = '1',  retry_check_interval = '0.016667',  scheduled_downtime_depth = '0',  should_be_scheduled = '1',  state_type = '0',  status_update_time = FROM_UNIXTIME(1510061550) WHERE host_object_id = 543
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Query: INSERT INTO icinga_statehistory (check_source, current_check_attempt, endpoint_object_id, instance_id, last_hard_state, last_state, long_output, max_check_attempts, object_id, output, state, state_change, state_time, state_time_usec, state_type) VALUES ('mbmif.int.netways.de', '2', 214, 1, '0', '1', '', '3', 543, 'Hello from mbmif.int.netways.de', '1', '1', FROM_UNIXTIME(1510061550), '599344', '0')
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Query: COMMIT
[2017-11-07 14:32:30  0100] debug/IdoMysqlConnection: Query: BEGIN

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 7, 2017

I do have the feeling that either

  • the entries in the IDO database are not flapping stopped (but started)
  • Icinga Web 2 displays the wrong details

I cannot hit the second condition which would call OnNotifyFlapping() in } else if (!in_downtime && was_flapping && !is_flapping) { in lldb.

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 7, 2017

Ok, I see it. This is unrelated from the flapping fixes, but something which sits in there long time.

Icinga Web 2 determines FlappingStart with event_type=1000 and FlappingEnd with event_type=1001.

dbevents.cpp populates that correctly.

        if (checkable->IsFlapping())
                fields1->Set("event_type", 1000);
        else {
                fields1->Set("event_type", 1001);
                fields1->Set("reason_type", 1);
        }

Guess what IsFlapping() returns, if flapping is disabled on the checkable object (or globally).... right, it will always set 1001 which means you get fancy "Flapping Stop" entries, no matter what.

Still, this doesn't explain why NotifyFlapping as event is triggered.

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 7, 2017

OnFlappingChanged() is not called from within checkable-check.cpp.

(lldb)
frame #15: 0x00000001040f688d libicinga.2.7.1.dylib`icinga::ObjectImpl<icinga::Checkable>::NotifyFlapping(this=0x000000010300fe00, cookie=0x0000000101303498) at checkable.tcpp:2316
   2313	{
   2314		ConfigObject *dobj = dynamic_cast<ConfigObject *>(this);
   2315		if (!dobj || dobj->IsActive())
-> 2316			OnFlappingChanged(static_cast<Checkable *>(this), cookie);
   2317	}
   2318
   2319	void ObjectImpl<Checkable>::NotifyEnableActiveChecks(const Value& cookie)

instead, this is an event coming from the class variable registered in checkable.ti for flapping.

Meaning to say, every time you call SetFlapping() this event triggers. Inside IDO database, API event streams, etc.

(lldb) up
frame #16: 0x00000001040f39b6 libicinga.2.7.1.dylib`icinga::ObjectImpl<icinga::Checkable>::SetFlapping(this=0x000000010300fe00, value=false, suppress_events=false, cookie=0x0000000101303498) at checkable.tcpp:1914
   1911	{
   1912		m_Flapping = value;
   1913		if (!suppress_events)
-> 1914			NotifyFlapping(cookie);
   1915	}
   1916
   1917	void ObjectImpl<Checkable>::SetEnableActiveChecks(bool value, bool suppress_events, const Value& cookie)
(lldb) up
frame #17: 0x0000000104128df0 libicinga.2.7.1.dylib`icinga::Checkable::UpdateFlappingStatus(this=0x000000010300fe00, stateChange=false) at checkable-flapping.cpp:56
   53
   54  		SetFlappingBuffer((stateChangeBuf.to_ulong() | (oldestIndex << 20)));
   55  		SetFlappingCurrent(flappingValue);
-> 56  		SetFlapping(flapping);
   57  	}
   58
   59  	bool Checkable::IsFlapping(void) const
(lldb) up
frame #18: 0x00000001043b856a libicinga.2.7.1.dylib`icinga::Checkable::ProcessCheckResult(this=0x000000010300fe00, cr=0x0000700007d6f050, origin=0x0000700007d6e078) at checkable-check.cpp:320
   317
   318 		bool was_flapping = IsFlapping();
   319
-> 320 		UpdateFlappingStatus(old_state != cr->GetState());
   321
   322 		bool is_flapping = IsFlapping();
   323

Which makes me believe - if I do not have flapping detection enabled, those events must not be triggered, nor flapping calculation.

@Crunsher
Copy link
Contributor

Crunsher commented Nov 7, 2017

You are currently looking at an old state. The fix was reset and will follow when I get around to doing the documentation. This branch is out of date.

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 7, 2017

Note: I'm using git master without any fixes from yours to analyze the problem. I'll wait for your branch and PR then. I do have some slight log message patches in my stash.

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 7, 2017

Thought about on my way home - the problem is the triggered signal OnFlappingChanged (and NotifyFlapping). This fits perfectly for ido updates, cluster events and api streams.

It has two problems

My proposal - use a different signal for checkable, and let ido/cluster/apisubscribe to that, similar to all other signals we have.

Crunsher added a commit that referenced this issue Nov 8, 2017
@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 8, 2017

Review with @gunnarbeutner

  • UpdateFlappingStatus() must be called every time to ensure that flapping detection works when enabled later on.
  • SetFlapping() now suppresses events which fixes the IDO flappinghistory problem

TODOs:

  • SupressEvents() should be removed from header
  • log messages in ProcessCheckResult()

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 8, 2017

All set.

@Crunsher
Copy link
Contributor

Crunsher commented Nov 8, 2017

Anything left to do, or can we merge this?

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 8, 2017

There's a note on the comment in the PR which is unclear.

@dnsmichi
Copy link
Contributor Author

dnsmichi commented Nov 8, 2017

Discussed it offline, a misunderstanding with private/public. Everything's fine.

Crunsher added a commit that referenced this issue Nov 8, 2017
Fix flapping calculation and events

fixes #5720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results blocker Blocks a release or needs immediate attention bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants