Opened 5 years ago

Closed 17 months ago

#846 closed defect (wontfix)

Fetching plugin data or config via telnet yields no output

Reported by: holle Owned by: nobody
Priority: normal Milestone: Munin 2.0.13
Component: plugins Version: 1.4.3
Severity: blocker Keywords: plugins network server
Cc:

Description

We upgraded from Munin 1.2.5/6 to Munin 1.4.3 and used the Packman RPMs available under http://packman.links2linux.de/package/munin to rebuild a RPM for SLES10.

Upgrading went fine but neither the master's munin-update ran (I will report this bug separately) nor the clients delivered any output to the munin-master.

I debugged for about 3 hours on a "clean machine" now and can definitely say the following:

  • All paths are correct
  • Userid:groupid gets changed to nobody:munin
  • All configured plugins are found and listed in the logfile when munin-node is ran with --debug
  • Running munin-run <plugin> yields the expected result
  • Telnetting either from localhost or from another (allowed) host and calling fetch <plugin> yields nothing except a "."
  • Enabling the print statements in Munin::Node::OS::read_from_child show that no bytes are read when the plugin is run in "TCP/IP" mode
  • I tried 1.4.3 as well as trunk-r3327, both "behave" equally

e.g. cpu:

startup

Configuring cpu
Atempting to read from plugins stderr
Read 58 bytes from plugin stderr
Atempting to read from plugins stderr
Read 25 bytes from plugin stderr
Atempting to read from plugins stderr
Read 47 bytes from plugin stderr
Atempting to read from plugins stdout
Read 165 bytes from plugin stdout
Atempting to read from plugins stdout
Read 93 bytes from plugin stdout
Atempting to read from plugins stdout
Read 147 bytes from plugin stdout
Atempting to read from plugins stdout
Read 116 bytes from plugin stdout
Atempting to read from plugins stdout
Read 44 bytes from plugin stdout
Atempting to read from plugins stdout
Read 130 bytes from plugin stdout
Atempting to read from plugins stdout
Read 207 bytes from plugin stdout
Atempting to read from plugins stdout
Read 193 bytes from plugin stdout
Atempting to read from plugins stdout
Read 167 bytes from plugin stdout
Atempting to read from plugins stdout
Read 0 bytes from plugin stdout
2010/01/25-15:20:12 [541] Error output from cpu:
2010/01/25-15:20:12 [541]       # Set /rgid/ruid/egid/euid/ to /109/65534/109 109 /65534/
2010/01/25-15:20:12 [541]       # Setting up environment
2010/01/25-15:20:12 [541]       # About to run '/etc/munin/plugins/cpu config'
        Adding to node cat-serv-vm4.catworkx.de

startup

---

telnet

2010/01/25-15:21:56 [1039] DEBUG: < config cpu
2010/01/25-15:21:56 [1039] DEBUG: Running command "config cpu^M".
Atempting to read from plugins stderr
Read 58 bytes from plugin stderr
Atempting to read from plugins stderr
Read 25 bytes from plugin stderr
Atempting to read from plugins stderr
Read 47 bytes from plugin stderr
Atempting to read from plugins stdout
Read 0 bytes from plugin stdout
2010/01/25-15:21:57 [1039] Error output from cpu:
2010/01/25-15:21:57 [1039]      # Set /rgid/ruid/egid/euid/ to /109/65534/109 109 /65534/
2010/01/25-15:21:57 [1039]      # Setting up environment
2010/01/25-15:21:57 [1039]      # About to run '/etc/munin/plugins/cpu config'
2010/01/25-15:21:57 [1039] DEBUG: > .

telnet

---

munin-run

# munin-run cpu config
graph_title CPU usage
graph_order system user nice idle iowait irq softirq
graph_args --base 1000 -r --lower-limit 0 --upper-limit 200
graph_vlabel %
graph_scale no
graph_info This graph shows how CPU time is spent.
graph_category system
graph_period second
system.label system
system.draw AREA
system.min 0
system.type DERIVE
system.info CPU time spent by the kernel in system activities
user.label user
user.draw STACK
user.min 0
user.type DERIVE
user.info CPU time spent by normal programs and daemons
nice.label nice
nice.draw STACK
nice.min 0
nice.type DERIVE
nice.info CPU time spent by nice(1)d programs
idle.label idle
idle.draw STACK
idle.min 0
idle.type DERIVE
idle.info Idle CPU time
iowait.label iowait
iowait.draw STACK
iowait.min 0
iowait.type DERIVE
iowait.info CPU time spent waiting for I/O operations to finish when there is nothing else to do.
irq.label irq
irq.draw STACK
irq.min 0
irq.type DERIVE
irq.info CPU time spent handling interrupts
softirq.label softirq
softirq.draw STACK
softirq.min 0
softirq.type DERIVE
softirq.info CPU time spent handling "batched" interrupts
steal.label steal
steal.draw STACK
steal.min 0
steal.type DERIVE
steal.info The time that a virtual CPU had runnable tasks, but the virtual CPU itself was not running

munin-run

Change History (8)

comment:1 Changed 5 years ago by holle

Rlated issue regarding the munin-master componenent as mentioned in the report: #847

comment:2 follow-up: Changed 5 years ago by holle

Ok, I seem to be closing in into the problem: I recompiled the RPM (using the exact same sources as on the SLE10) un my openSUSE 11.1 box, and everything worked.

So, the million bucks prize question is: what is different from a Perl 5.8.8 (on SLE10) to a Perl 5.10.0 (on a openSUSE 11.x/SLE11) that _this_ above behaviour appears.
And: can it be fixed apart from upgrading the whole perl (we might install a /usr/local or /opt perl 5.10.x)

comment:3 in reply to: ↑ 2 Changed 5 years ago by dkrotil

I have exact same problem on cca 50 servers with no option to upgrade Perl, these are SLES10 SP2/ OES2 SP1 servers. Upgrading to SLES10 SP3 change nothing Perl is same 5.8.8.

David

comment:4 Changed 4 years ago by tbiel

I analyzed this a bit further, as it was blocking the rollout of Munin 1.4.x on our infrastructure.
It seems that SLES10's perl swallows (or misassigns) STDOUT when exec()ing an external program - something
I haven't been able to verify via Google or Novell's Knowledge Base.

Using some print statements in Munin/Node?/OS.pm and Munin/Node/Service?.pm and using
Data::Dumper to analyze the result of the structure returned from fork_service I get the following
output in Debug mode:

2010/04/26-14:00:53 [30416] DEBUG: > # munin node at localhost
2010/04/26-14:00:55 [30416] DEBUG: < fetch apache_accesses
2010/04/26-14:00:55 [30416] DEBUG: Running command "fetch apache_accesses^M".
Atempting to read from plugins stderr
Read 200 bytes from plugin stderr
Atempting to read from plugins stdout
Read 0 bytes from plugin stdout
2010/04/26-14:00:55 [30416] DEBUG: In _run_service: res = $VAR1 = {\n          '
timed_out' => '',\n          'stderr' => [\n                        '# Set /rgid
/ruid/egid/euid/ to /10001/10001/10001 10001 /10001/',\n                        
'# Setting up environment',\n                        '# Environment url = http:/
/127.0.0.1:%d/server-status?auto',\n                        '# About to run \'/e
tc/munin/plugins/apache_accesses\''\n                      ],\n          'retval
' => '0',\n          'stdout' => []\n        };
2010/04/26-14:00:55 [30416] Error output from apache_accesses:
2010/04/26-14:00:55 [30416]     # Set /rgid/ruid/egid/euid/ to /10001/10001/1000
1 10001 /10001/
2010/04/26-14:00:55 [30416]     # Setting up environment
2010/04/26-14:00:55 [30416]     # Environment url = http://127.0.0.1:%d/server-s
tatus?auto
2010/04/26-14:00:55 [30416]     # About to run '/etc/munin/plugins/apache_access
es'
2010/04/26-14:00:55 [30416] DEBUG: In _print_service: started
2010/04/26-14:00:55 [30416] DEBUG: > .
2010/04/26-14:00:55 [30416] DEBUG: In _print_service: finished
2010/04/26-14:01:05 [30416] Connection timed out

Obviously, the exec() writes to STDERR, but nothing to STDOUT.

A quick and dirty solution is to not use exec(), which can be done with the following patch:

--- node/lib/Munin/Node/Service.pm.orig 2010-04-26 12:33:56.217889267 +0000
+++ node/lib/Munin/Node/Service.pm      2010-04-26 14:07:09.951259792 +0000
@@ -212,7 +212,14 @@
     print STDERR "# About to run '", join (' ', @command), "'\n" 
        if $config->{DEBUG};
 
-    exec @command;
+    # work around exec bug in SLES10's Perl
+    # exec @command;
+    my $res = `@command`;
+    print STDERR "# '", join (' ', @command), "' returned $res\n"
+        if $config->{DEBUG};
+    print STDOUT $res;
+    exit;
+
 } 

I haven't found a way of properly integrating this into the code such that it is only used on SLES10.

Please execute proper care when using this patch - it is, after all, only a workaround.

Thorsten

comment:5 Changed 4 years ago by tbiel

I just found out that SLES 9 (Update 4) seems to have the same problem, and that the
above patch solves it here, too.

Thorsten

comment:6 Changed 4 years ago by hennings

Also the perl (5.8.8-32.el5) in CentOS 5.5 has the same problem - the patch works fine for us.

comment:7 Changed 3 years ago by cmdrclueless

I just spent most of a day trying to figure out what was going wrong. I spent a lot of time looking at the code, printing out file handle information, and using strace to track the running process. What I finally discovered is that after the initialization phase but before a network client connected, the STDOUT handle had been changed so that it no longer pointed at file descriptor 1. File descriptor 1 on Linux/Un?*x is used as STDOUT on all new processes.

The following patch seems to take care of the problem. I didn't have time to track down which module changed the Perl STDOUT handle from descriptor 1 to another descriptor.

diff -urN munin-1.4.5.orig/node/lib/Munin/Node/OS.pm munin-1.4.5/node/lib/Munin/Node/OS.pm
--- munin-1.4.5.orig/node/lib/Munin/Node/OS.pm  2011-03-30 10:52:37.000000000 -0400
+++ munin-1.4.5/node/lib/Munin/Node/OS.pm       2011-03-30 15:26:21.000000000 -0400
@@ -202,9 +202,14 @@
 
         POSIX::setsid();
 
-        # use the pipes to replace STD{OUT,ERR}
-        open(STDOUT, '>&=', fileno($out_write));
-        open(STDERR, '>&=', fileno($err_write));
+       POSIX::close(0);
+       open(my $stdin, "</dev/null");
+
+       POSIX::close(1);
+       POSIX::dup(fileno($out_write));
+
+       POSIX::close(2);
+       POSIX::dup(fileno($err_write));
 
         exit $code->(@args);
     }

comment:8 Changed 17 months ago by snide

  • Milestone changed from Munin 1.4.4 to Munin 2.0.13
  • Resolution set to wontfix
  • Status changed from new to closed

Closing the ticket as "wontfix". Telnet is practical, but not a valid method of debugging. It adds some control chars, and expects things.

Better use netcat (/bin/nc, /usr/bin/nc, ...) or even socat.

I leave the patch here so they can be used for further reference.

Note: See TracTickets for help on using tickets.