#792 closed defect (fixed)

Socket read timeout confuses munin-update

Reported by: gombasg Owned by: nobody
Priority: normal Milestone: Munin 2.0.0
Component: master Version: 1.4.1
Severity: normal Keywords:



Sometimes configuring/fetching a plugin takes long, and that seems to confuse munin-update.

Background: querying LVM information can take long, so I have set the timeout to 60 seconds in munin-node.conf. That works, since munin-node does not report any timeouts. Both the client and master runs munin 1.4.1-2 from Debian Sid.

Sometimes I get these messages in munin-update.log:

[WARNING] Socket read timed out to foo.bar: alarm
[WARNING] 12 lines had errors while 0 lines were correct in data from 'fetch lvm_systemvg' on foo.bar/foo.bar:4949
[WARNING] Socket read timed out to foo.bar: alarm
[WARNING] 364 lines had errors while 0 lines were correct in data from 'fetch if_err_bond0_101' on foo.bar/foo.bar:4949
[WARNING] Service if_err_bond0_101 on foo.bar/foo.bar:4949 returned no data for label var
[... many similar messages ...]

The problem is, label var belongs to the configuration of the lvm_systemvg plugin, not if_err_boind0_101.

I captured the communication between munin-node and munin-update, and the interesting part looks like this ('#' marks my comments):

-> config lvm_systemvg                       # this seems to have caused the first "Socket read timed out"
-> fetch lvm_systemvg
<- graph_title Logical Volume usage          # oops, this is the reply to "config lvm_systemvg", not "fetch lvm_systemvg"
<- graph_args --base 1000 -l 0
<- graph_category disk
<- graph_info This graph shows disk usage on the machine.
<- free.label free
<- free.draw AREA
<- tmp.label tmp
<- tmp.draw STACK
<- usr.label usr
<- usr.draw STACK
<- var.label var
<- var.draw STACK
<- .
-> config lvm_datavg                         # another "Socket read timed out"
-> fetch lvm_datavg
-> config if_err_bond0.101
<- free.value 233001975808                   # this is the reply to "fetch lvm_systemvg"
<- tmp.value 998244352
<- usr.value 1996488704
<- var.value 3997171712
<- .
-> fetch if_err_bond0.101
<- graph_title Logical Volume usage          # ... and the reply to "config lvm_datavg"
<- graph_args --base 1000 -l 0
<- graph_category disk
<- graph_info This graph shows disk usage on the machine.
<- free.label free
<- free.draw AREA

That is, it seems munin-update times out and tries to query the next plugin, but munin-node does not know anything about it and happily sends the data for the previous request.

IMHO this is the reason that on the next munin-update run, I get errrors like:

[ERROR] Error occured in under [] in the configuration.
[Warning] Could not parse datafile /var/lib/munin/datafile: [ERROR] Error occured in under [] in the configuration.  Please
 refer to the log if details are missing here.

IMHO this is also the reason why graphs are sometimes get mixed up (like the graph of one of the LVM plugins is shown in place of the weekly graph of one of the network interface plugins etc.).

A possible fix would be that after the "Socket read timed out", munin-update should somehow tell munin-node to reset the communication state, and munin-update should not send any other command and it should ignore any data it receives until munin-node acknowledges the reset command. That would ensure that both parties are in sync again. Unfortunately I do not know the protocol in detail so I'm not sure how this can be implemented. If such a synchronization can not be done, then just closing the connection would be better than processing/storing invalid definitions/data that was originally produced by a different plugin.

Change History (5)

comment:1 Changed at 2009-12-10T07:29:06+01:00 by gombasg

FYI, I'm seeing the same issue on a second master host now as well. I tried to set io_timeout to 60 in Munin::Master::Node::new() and that seems to help. However this change will be blown away by package upgrades, so it would be nice to be able to set the socket read timeout from munin.conf.

comment:2 Changed at 2009-12-10T08:09:06+01:00 by gombasg

Ok, forget the above. Raising io_timeout does not solve the problem, it just happens less frequently.

comment:3 Changed at 2009-12-10T12:39:17+01:00 by janl

AHA. Thanks for your analysis. The problem has been not been seen among my testers or me for a long while now and we have been unable to reproduce.

comment:4 Changed at 2009-12-10T12:42:26+01:00 by janl

A i/o read timeout in munin-update should result in disconnect and perhaps reconnect. This means that we should use a rather high timeout.

comment:5 Changed at 2009-12-10T14:55:14+01:00 by janl

  • Resolution set to fixed
  • Status changed from new to closed

This should be fixed in r3214. Will be in 1.4.2. Many thanks for your excellent report!


Note: See TracTickets for help on using tickets.