#732 closed defect (fixed)

munin-limits hangs when more than one contact is used

Reported by: feiner.tom Owned by: jo
Priority: normal Milestone: Munin 1.4.7
Component: deb package Version: 1.2.6
Severity: normal Keywords:
Cc:

Description

Forwarded from: http://bugs.debian.org/553528

Package: munin
Version: 1.2.6-10~lenny1
Severity: normal

Hi!
if munin is configured with more than one contact, like:

contact.foo.command mail -s "Meh" foo@example.com
contact.bar.command mail -s "Meh" bar@example.com

then when munin-limits has to send messages it hangs.

This is an hanged run of munin-limits with the above contacts:

28350  \_ /bin/sh /usr/bin/munin-cron
28636      \_ /usr/bin/perl /usr/share/munin/munin-limits
28637          \_ mail -s Meh bar@example.com
28638          |   \_ /usr/bin/perl /usr/share/munin/munin-limits
28639          |   \_ /usr/bin/perl /usr/share/munin/munin-limits
28640          \_ mail -s Meh foo@example.com
28641              \_ /usr/bin/perl /usr/share/munin/munin-limits
28642              \_ /usr/bin/perl /usr/share/munin/munin-limits

If you inspect the opened file descriptors you will find:

/proc/28636/fd/ # munin-limits, child of munin-cron
0 -> pipe:[40611921]
1 -> pipe:[40611922]
2 -> /var/log/munin/munin-limits.log
5 -> pipe:[40612630]

/proc/28637/fd/ # mail bar@example.com
0 -> pipe:[40612627]
1 -> pipe:[40612628]
2 -> pipe:[40612629]
3 -> /tmp/mail.RsXXXXnAmEOR (deleted)

/proc/28638/fd/ # mail bar@example.com stdout child
0 -> pipe:[40612628]
1 -> pipe:[40611922]
2 -> /var/log/munin/munin-limits.log
3 -> /var/log/munin/munin-limits.log

/proc/28639/fd/ # mail bar@example.com stderr child
0 -> pipe:[40612629]
1 -> pipe:[40611922]
2 -> /var/log/munin/munin-limits.log
3 -> /var/log/munin/munin-limits.log
4 -> pipe:[40612628]

/proc/28640/fd/ # mail foo@example.com
0 -> pipe:[40612630]
1 -> pipe:[40612631]
2 -> pipe:[40612632]
3 -> /tmp/mail.RsXXXX6laUNR (deleted)

/proc/28641/fd/ # mail foo@example.com stdout child
0 -> pipe:[40612631]
1 -> pipe:[40611922]
2 -> /var/log/munin/munin-limits.log
3 -> /var/log/munin/munin-limits.log
4 -> pipe:[40612627]

/proc/28642/fd/ # mail foo@example.com stderr child
0 -> pipe:[40612632]
1 -> pipe:[40611922]
2 -> /var/log/munin/munin-limits.log
3 -> /var/log/munin/munin-limits.log
4 -> pipe:[40612627]
5 -> pipe:[40612631]

Why do the children of "mail foo@…" have on "4" the stdin of "mail
bar@…" ? To me it seems children created with "open()" don't close
file descriptors inherited from their parent.

Why the logfile is opened more than once in some processes and stdout is left
opened even if they log?

This is slightly unrelated but, why three processes are needed for each
contact?

Thank you,

Luca

Change History (10)

comment:1 Changed at 2010-02-22T17:01:30+01:00 by schamane

On March 16, 2009, I offered the following fix on munin-users:

--- munin-limits.o      2009-03-16 09:15:50.000000000 +0100
+++ /usr/share/munin/munin-limits       2009-03-16 09:28:25.000000000 +0100
@@ -486,6 +486,9 @@
            logger ("sending message: \"$txt\"") if ($DEBUG);
            print $pipe $txt, "\n" if (defined $pipe);
            $config->{'contact'}->{$c}->{'num_messages'}++;
+               # added by (as) 2009-03-15 00:56
+               close ($pipe) if (defined $pipe);
+               logger ("hack by (as): pipe closed") if ($DEBUG);
        }
     }
 }

I have used this without problems ever since.

comment:2 Changed at 2010-05-16T17:56:20+02:00 by snide

  • Milestone set to Munin 1.5
  • Resolution set to fixed
  • Status changed from new to closed

Applied in trunk/ in r3598.

holger: do you want to apply it in 1.4 (and eventually 1.2 ?)

comment:3 Changed at 2010-12-29T13:22:02+01:00 by janl

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:4 Changed at 2010-12-29T13:22:31+01:00 by janl

  • Milestone changed from Munin 2.0 to Munin 1.4
  • Owner changed from holger to jo
  • Status changed from reopened to new

comment:5 Changed at 2010-12-29T13:23:25+01:00 by janl

Jimmy may be preparing a new 1.4 release. Assigning to him.

comment:6 Changed at 2011-01-04T15:48:03+01:00 by jo

  • Resolution set to fixed
  • Status changed from new to closed

The fix from trunk has now been backported to the 1.4 branch (r4073), and will be in the next 1.4 release.

comment:8 Changed at 2011-03-18T04:37:26+01:00 by snide

  • Milestone changed from Munin 1.4 to Munin 1.4.6

@mfoster: done

comment:9 Changed at 2011-09-07T21:29:00+02:00 by bldewolf

  • Resolution fixed deleted
  • Status changed from closed to reopened

Unfortunately, the change committed to resolve this bug causes more issues than it resolves. I assume the problem doesn't show up with max_messages is 1, but otherwise the log fills with:

2011/09/07 12:14:52 [PERL WARNING] print() on closed filehandle $pipe at /usr/share/perl5/vendor_perl/Munin/Master/LimitsOld.pm line 708.

N-1 times, where N is the number of alerts. This means that, in 1.4.6, all users receiving alerts will now only receive the first alert and the others will be lost.

Also, to respond to a question in the original ticket: Munin opens a handful of processes per contact to catch and log stdout/stderr. It doesn't seem particularly effective or necessary, but that's why. It's also why the processes have intermingled pipes between them.

Anyway, I've setup a test environment with two mail commands and I can't get it to trip up the same way. The one caveat is that munin-limits makes no attempt to put time limits on the commands it opens. So if your mail commands are stalling for some reason then munin-limits is going to stall until they come back.

I'd love to get more information on how to reproduce this properly so I can fix it because it's currently got 1.4.6 in a really broken state.

comment:10 Changed at 2011-09-29T02:50:36+02:00 by bldewolf

  • Resolution set to fixed
  • Status changed from reopened to closed
I was finally able to reproduce this in a test environment and it seems some interaction between open with the "" mode and having lots of opened processes was causing deadlocks. Since trunk already moved towards removing the extra processes, I pushed it further with r4400 in both trunk and 1.4-stable so that munin-limits won't trip over itself while dealing with contact commands and also won't get stuck waiting for them when finishing up. Hopefully this puts the issues with calling out to commands in munin-limits to rest for good (it did in my test environment, anyway!).
Note: See TracTickets for help on using tickets.