#766 new defect

When SNMP plugins disappear the hosts disappear from the html pages.

Reported by: janl Owned by: janl
Priority: highest Milestone: Munin 3.0.0
Component: master Version: 1.4.0
Severity: major Keywords:
Cc:

Description

If the SNMP plugins disappear from a node the hosts also disappear from the html pages. This problem needs to be handled more as a error so that the old data from the "datafile" is recycled - the same way as when a whole node goes down.

This problem is too late to fix in 1.4.0 for stability reasons, but I think 1.4.1 is just right.

Change History (8)

comment:1 Changed at 2009-11-26T14:21:30+01:00 by janl

  • Component changed from plugins to master

comment:2 Changed at 2009-11-30T23:53:23+01:00 by janl

  • Version changed from 1.4.0-beta to 1.4.0

comment:3 Changed at 2009-12-04T13:25:49+01:00 by janl

This involves changes in error handling so it's deemed a bit complex. Hold off for a bit.

comment:4 Changed at 2010-03-08T13:16:01+01:00 by janl

  • Milestone changed from Munin 1.4.4 to Munin 1.4.5

comment:5 Changed at 2010-03-12T09:18:51+01:00 by janl

The failure mode this ticket refers to is primarily where the node is up but denies all knowledge of the host.

There seems to be a secondary failure mode, possibly a cascade reaction after a protocol timeout. The host alvine ended up with being gone from the html page after going deep into swap and thrashing for an hour. It started like this:

2010/03/12 07:00:10 Missing required attribute 'label' for data source 'ordertotal' in service memcached_size on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949
2010/03/12 07:00:10 [WARNING] Service memcached_size on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label ordertotal
2010/03/12 07:00:11 [WARNING] Service hddtemp_smartctl on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label hda
2010/03/12 07:00:12 [WARNING] Service yum on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label pending
2010/03/12 07:00:12 [INFO] Reaping Munin::Master::UpdateWorker<nifs;alvine.api.kunder.linpro.no>.  Exit value/signal: 0/0
2010/03/12 07:05:32 [WARNING] Call to accept timed out.  Remaining workers: prod;ektorp.api.kunder.linpro.no, nifs;alvine.api.kunder.linpro.no, prod;fixhult.api.kunder.linpro.no
2010/03/12 07:05:44 Missing required attribute 'label' for data source 'ordertotal' in service memcached_size on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949
2010/03/12 07:05:44 [WARNING] Service memcached_size on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label ordertotal
2010/03/12 07:05:46 [WARNING] Service hddtemp_smartctl on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label hda
2010/03/12 07:05:47 [WARNING] Service yum on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label pending
2010/03/12 07:05:47 [INFO] Reaping Munin::Master::UpdateWorker<nifs;alvine.api.kunder.linpro.no>.  Exit value/signal: 0/0
2010/03/12 07:10:30 Missing required attribute 'label' for data source 'ordertotal' in service memcached_size on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949
2010/03/12 07:10:30 [WARNING] Service memcached_size on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label ordertotal
2010/03/12 07:10:31 [WARNING] Service hddtemp_smartctl on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label hda
2010/03/12 07:10:32 [WARNING] Call to accept timed out.  Remaining workers: prod;ektorp.api.kunder.linpro.no, nifs;alvine.api.kunder.linpro.no, prod;fixhult.api.kunder.linpro.no
2010/03/12 07:10:33 [WARNING] Service yum on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label pending
2010/03/12 07:10:34 [INFO] Reaping Munin::Master::UpdateWorker<nifs;alvine.api.kunder.linpro.no>.  Exit value/signal: 0/0
2010/03/12 07:15:31 [WARNING] Call to accept timed out.  Remaining workers: prod;ektorp.api.kunder.linpro.no, nifs;alvine.api.kunder.linpro.no, prod;fixhult.api.kunder.linpro.no
2010/03/12 07:15:34 Missing required attribute 'label' for data source 'ordertotal' in service memcached_size on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949
2010/03/12 07:15:35 [WARNING] Service memcached_size on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label ordertotal
2010/03/12 07:15:35 [WARNING] Service hddtemp_smartctl on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label hda
2010/03/12 07:15:36 [WARNING] Service yum on alvine.api.kunder.linpro.no/alvine.api.kunder.linpro.no:4949 returned no data for label pending
2010/03/12 07:15:41 [INFO] Reaping Munin::Master::UpdateWorker<nifs;alvine.api.kunder.linpro.no>.  Exit value/signal: 0/0
2010/03/12 07:20:32 [WARNING] Call to accept timed out.  Remaining workers: prod;ektorp.api.kunder.linpro.no, nifs;alvine.api.kunder.linpro.no, prod;fixhult.api.kunder.linpro.no
2010/03/12 07:20:51 [WARNING] Call to accept timed out.  Remaining workers: nifs;alvine.api.kunder.linpro.no
2010/03/12 07:21:01 [WARNING] Call to accept timed out.  Remaining workers: nifs;alvine.api.kunder.linpro.no
2010/03/12 07:21:11 [WARNING] Call to accept timed out.  Remaining workers: nifs;alvine.api.kunder.linpro.no
2010/03/12 07:21:21 [WARNING] Call to accept timed out.  Remaining workers: nifs;alvine.api.kunder.linpro.no
2010/03/12 07:21:31 [WARNING] Call to accept timed out.  Remaining workers: nifs;alvine.api.kunder.linpro.no
2010/03/12 07:21:41 [WARNING] Call to accept timed out.  Remaining workers: nifs;alvine.api.kunder.linpro.no
2010/03/12 07:21:51 [WARNING] Call to accept timed out.  Remaining workers: nifs;alvine.api.kunder.linpro.no
2010/03/12 07:22:01 [WARNING] Call to accept timed out.  Remaining workers: nifs;alvine.api.kunder.linpro.no
2010/03/12 07:22:02 [FATAL] Socket read timed out to alvine.api.kunder.linpro.no.  Terminating process. at /usr/lib/perl5/vendor_perl/5.8.8/Munin/Master/UpdateWorker.pm line 139
2010/03/12 07:22:02 [ERROR] Munin::Master::UpdateWorker<nifs;alvine.api.kunder.linpro.no> died with '[FATAL] Socket read timed out to alvine.api.kunder.linpro.no.  Terminating process. at /usr/lib/perl5/vendor_perl/5.8.8/Munin/Master/UpdateWorker.pm line 139
2010/03/12 07:22:11 [WARNING] Call to accept timed out.  Remaining workers: nifs;alvine.api.kunder.linpro.no
2010/03/12 07:22:11 [INFO] Reaping Munin::Master::UpdateWorker<nifs;alvine.api.kunder.linpro.no>.  Exit value/signal: 18/0
2010/03/12 07:22:11 [INFO] No old data available for failed worker nifs;alvine.api.kunder.linpro.no.  This node will disappear from the html web page hierarchy

comment:6 Changed at 2010-03-12T11:38:45+01:00 by nicholas

Here's the graf we wanted to see while debugging alvine ;-)
http://nicholas.users.linpro.no/tmp/alvine-memory-day.png

comment:7 Changed at 2011-12-06T00:22:11+01:00 by chosenwonton

Has this issue ever been resolved? The same exact thing happens on Munin 1.4.5:

2011/12/05 09:35:48 [INFO] No old data available for failed worker <worker name>. This node will disappear from the html web page hierarchy

Servers that are down should still display stale data, not disappear - I see no link to a fix for this issue on this ticket, can you let me know if there is one, or see if you can change this behavior?

Thanks,

Vaughn

comment:8 Changed at 2012-02-02T16:24:55+01:00 by snide

  • Milestone changed from Munin 1.4.7 to Munin 3.0

This issue is quite painful to solve without invasive changes. Moving it to 3.0.

Sorry.

Note: See TracTickets for help on using tickets.