MuninConf 1

It took place on IRC the 12/06/2013 from 2100 CET to 2300 CET.

Here is the log of it :

<@TheSnide> Hi everyone, thanks for joining #munin-conf !

Please do not ask questions here, but in #munin.

I'll be reading them in parallel, and I'll try to adapt the talk according to them, or just save the reply for the Q/A session.

The conference will be split in 3 major parts, each followed by a Q/A session.

1. State of Munin (past, present & future)

2. How to scale.

3. Rant Fest. A quite informal constructive brainstorm on what issues should be solved

but first, let's hand the "mic" to janl for an intro :)
<+janl> Hello everyone on irc and at NSA of course
Snide asked if I have anything I could say. Since I'm solidly out of munin development I have nothing relevant to say about munin specificlly
so I'm going to speak about "Release often, Release early"
which I was not very good at when I maintained Munin some years ago
(so the reason snide asked me is that I was the previous maintainer of munin)
many years ago I was the util-linux maintainer for a year or two
I had patched it up and ... didn't release a new version
  * h01ger waves with his debian munin maintainer hat
<+janl> I got called out for having release-nerves
I did have that
and then I was told to release
and I did
for a while I released new util-linux versions quite regularly
fastforward many years and I became the munin maintainer
please forgive me for I was not very good at it
there was so many tickets in the trac
and I wanted to fix a lot of them before making releases
so I rationalized and postponed
and the munin releases were few and far between
we wrote some docs, that was good
< dipohl> :-)
<+janl> we fixed some more bugs than before, that was good
but I was otherwise too passive
In retrospect I broke the law of Release often, Release early
it would have been better if I followed it
Snide is much better at this, for this we can be thankful
but the law also applies to plugin authors
  * TheSnide blushes
<+janl> I know for a fact that there are hundreds of munin plugins hidden away
supporting many interesting things
netapps, oracle rac, and many other cool or enterprise or otherwise interesting things
people think they are not good enough to show to others
but that isn't the point
if you have a plugin for monitoring "gazonk" that works for you then release it
even if it sucks
because the next person that needs to monitor "gazonk" will enhance it
and here is the second obligation of the open source developer, send in your patches
thankfully GIT rocks at this
so, Release early, Release often, and send in your patches.
< dipohl> thank you :)
<+janl> over to TheSnide
<@TheSnide> Clap clap clap ... ! :)
  * ssm cheers :D
< h01ger> :)
< dipohl> Concerning the plugins: I am still interested in the _vetted_ plugin collection :-=
<@TheSnide> janl: thx. i know you have a heavy schedule, so big kudos for the very nice talk

dipohl: more on that later #teaser

as janl did ask for an early leave perm, we can have a short Q/A session
< dipohl> I am confused about what is the right place for documentation nowadays
<@TheSnide> oh, he left already :)
< dipohl> shall we open a wiki board for Q/A Session?
<@TheSnide> dipohl: i'll parse the logs from here, and put them online on our wiki
< dipohl> then you can select and order them
I would like to have a "summary" view, not only the logs
<@TheSnide> for q/a, just ask here.

dipohl: yeah, i'll postprocess them :)
< dipohl> :-)
<@TheSnide> about the doc, _that_ a real issue.
  * ssm agrees
<@TheSnide> ssm initiated a wonderful doc/ subdir, which compiles to a nice website/book.

and i agree that it competes with the wiki.
  * dipohl saw it and liked it :-)
<@TheSnide> i'll say more on that later
< dipohl> but we have to find rules
<@TheSnide> so, let's continue the lectures if you agree

... ... ...

starting with the 1rst part : "State of Munin"

Genesis of Munin 2.0

I discovered Munin in 2007, with version 1.2.

janl was the current maintainer (as he said)

I was using Cacti at the time, but fell in love with the ease of writing custom plugins in Munin.

The fact that it was in Perl was also a good point to me, as I was much more fluent in Perl than in PHP.

The fact that is only uses text files rather than a full-fledged MySQL also help a lot.

But I quickly grew frustated in the defaults views as they didn't offer any details, so I started to write a zooming extension.

It was first quite limited, only the IMG size was configurable.

Then slowly I designed the "pinpoint" to be able to have any start and stop epochs.

I based my hack on a very rough version of CGI (remember, it was in the 1.2 days...).

After a while, the 1.4 came out, and it mostly changed everything I based my hack on.

(yeah, janl does have a enormous coding power)


So, i rewrote the whole hack, and decided to release it.

(i didn't want to be alone maintaining it)

It was accepted as to be merged in the 1.5 version.

That was in the beginning of 2010 !

(... 3 years ! )

After a *long* time, janl had to leave from his leadership and I did a mostly peaceful takeover.

well.. *mostly* :)

on a unrelated note, We had some issues with our exchange website, that we finally moved to github.

More on that in a later talk, if there's some interest.
  * dipohl has
<@TheSnide> Then (too) many features creaped in, and h01ger finally managed to convinced me to release 2.0, in 2012...

(... another 2 years ! )
< dipohl> so 5 together ;)
<@TheSnide> Well, it was rather a bumpy ride, since testing on the alpha and beta release were obviously not enough.

Yet, I still think he was right.

As we only had a very small testing power on our hands, we had to rely on unwary users. Sorry for that :)

(that's where i rejoin janl)

So, after that, well, the move from svn to git was quite unevenful.

It enabled to federate the contributions *much* more easily, and increase the rate of user-contributed bugfixes. Thanks to you !

Actual state of Munin 2.0

So, looking at the current state, I'm quite proud. Most bugs are ironed out, munin2 is now in almost every major distribution.

Even Debian stable !!

The relationship with distrib maintainers has improved a lot,

(not that we had bad relations)
< dipohl> b.t.w. I would like to talk about the Epel Distro (later)
<@TheSnide> just that it was quite tenuous.

I think it's mostly due to my presence on IRC that acted like a catalyser on the whole team.

The only grey area is the Ubuntu distrib that doesn't have a "dedicated" maintainer (yet ?).

... and other that are not "major" one. #unleashtrolls
< dipohl> and packman is a shady landscape..
<@TheSnide> that said, the least tested part is the newest one : async.

It doesn't even have proper doc ! :)

That said some bug reports start to come, as more and more users are testing/suffering.

most are due to the poor state of the documentation, and misunderstandings.

but some real, nasty, bugs are still in here.

so, as I already said before : I won't offer the best product. But I try hard to offer the best support :)
< dipohl> :-)
  * TheSnide has a laggy isp :-/ #not_now !
<@TheSnide> anyway, I think the doc is *the* thing to fix right now.
Thanks to ssm, it's now in the very easy RST format, located in doc/ on the git repository, so you _can_ contribute.
Ideally, if you write a little peace of doc anytime you get an answer, as a "thank you" gift, everyone would benefit :-D
< dipohl> we need some rules which part of docs to place there and which in the wiki
<@TheSnide> "* ssm would like to use the man pages generated out of the /doc directory, instead of the pod-generated ones." <-- that would be nice
< h01ger> asciidoc makes this trivial
<@TheSnide> so.. more on the docs later (5 min, promise)
  * h01ger would suggest to write the docs in some git repo, probably just the main munin.git repo. wikis lead to lots of different docs + infos
<@TheSnide> The SSH transport has also received a lot of attention these days, as it's the standard transport for async.
< h01ger> and before i forget: /me would like to see unin.png go down a lot in the next year. thats a lot of legacy we didnt want to touch as wheezy was frozen the last half year, but now its time to tackle those 60 bugs in debian and bring that down to 20 or less. /me thinks out of those 60, probably 40 are upstream issues. plus i think it would be good to split the plugins into many many binary packages (coming fr

om one source package) and to package the munin-contrib plugins. as i have tons of other things to do, __help__ __is__ __definitly__ __most__ __welcome__! :)
<@TheSnide> It also integrates itself nicely in a secured network.
< ssm:#munin> And even a question is helpful, since it indicates that the answer is hard to find
<+ssm> h01ger: the docs in n/tree/devel/doc ends up as, and also makes man pages, which are not the set used for the moment
< h01ger> ssm, thats munin.git/doc ?
<+ssm> yes
<@TheSnide> yes, that's why i try to add a section each time i get a question :)
< h01ger> ssm, nice
<@TheSnide> .... Bah, let's start the Q/A session... as h01ger already started :p
<+ssm> :)
< h01ger> :)
< neteng> How long before proper docs come out for async?
<@TheSnide> for docs, i think the wiki was a great tool. but nowadays with the advent of git, github & readthedocs, doc/ is better.
neteng: depends on *you* :-)
< chteuchteu> Indeed, time = work / contributors ;)
<+ssm> neteng: I'm deploying munin-async to about 1k nodes very soon, I expect to write useful docs as part of that.
< neteng> nice

I just did it to about 14 nodes and small issues here and there cropped up
<+ssm> There is some doc at ode/async.html
< ze-> Any code-oriented documentation anywhere, on plan to have such ? :)
<@TheSnide> ssm: yup. i still have to complete it with several things laying around here
ze-: meaning ?
the code needs no doc. it _is_ the doc :)
< dipohl> e.g. doxygen
< ze-> yeah. well, some doc would help prevend having to read it all.
something generated from comments (like doxygen i guess, or an other) would help.
<@TheSnide> ze-: such as ?
< ze-> Would just need to check the code, and add those comments. :)
<+ssm> ze-: comments are not guaranteed to be in sync with the code :)
  * TheSnide tries hard to write readable code.
< ze-> ssm: they still would be easier to update with the code, than a separate documentation about the code.
<@TheSnide> but, as munin is more than 11 years old... some area feel like geology :)
<+ssm> true
<@TheSnide> 1rst commit ~ 2002
1rst commit in svn ~ 2004
at the time the project was called lrrd IIRC
and, the Joel rule of software was mostly true. It takes 10y to have a good soft :)
munin2 was released around the 10th birthday :)
for the perf-interested souls, it'll begin at 2200
... or it can begin earlier if no more questions
< dipohl> It would be good to have a comments function for *all world* in the docs

like mysql does

often you find good recipes there :-)

and the thresh is high, if you need a github account to contribute to the docs
< ze-> mysql has pages that don't change (much). Munin tend to move fast... doubt it would be easy to integrate for now.
<@TheSnide> dipohl & ze-: i'm lost
< dipohl> you talk to much on the other channel ;)
  * h01ger is also lost with these two channels
<@TheSnide> dipohl: "comments function for *all world*"
< dipohl> yes
<@TheSnide> dipohl: what do you mean ?
< dipohl> you want to give-up the wiki

if I understand it right?
< ze-> check elect.html
< zembutsu> ri
< ze-> at the end, you have Posts by users.
<@TheSnide> well, i think it's the right move
< ze-> dipohl: that's what you had in mind, wasn't it ?
< dipohl> yes

comment function for anyone
<@TheSnide> ohh. like the "annotated" pgsql doc ?
< dipohl> I suppose
< ze-> For mysql, the documentation doesn't change, so pages last for long, and users add comments/help/...
  * TheSnide tend to pirate eveything that pgsql does :)
<+ssm> I'd say that "github" also does things easier. If you're logged in, you can hit n/blob/devel/doc/node/async.rst, and press the "edit" button. A pull request will be created if you save.
<@TheSnide> so, i have 2202 here. let's proceed to the next step

... ... ...

The 2nd part is more about "how to scale".

I noticed that munin does scale quite nicely. Specially compared to 1.4.x.

yet... i also noticed some anti-patterns in IRC :)

Some obvious advices would be :

(to scale)

#1 use a CGI only setup

#2 even FastCGI only.

#3 use RRDCACHED. Even on SSD.

(more details on that by ssm just after)

#3a do NOT read from RRD yourself. EVER.

(just added it 3 hours ago from tabakhase)


#4 NFS for RRD works, but beware about any shared HW. Munin has the tendency to annihilate any HW it runs on.

(special for ze-)

#7 have RAM, and abuse it via huge RRDCACHE buffers.

#5 do NOT swap. prefer to use less FastCGI or munin-update workers.

(oops, mixed the ordering :/)
< dipohl> #6 still missing ;)
<@TheSnide> and yes, some ppl did want to paralellize more, but didn't add enough RAM

#6 use async, in order to limit the number of m-u needed, and to avoid loosing some history.


#7 In case you want more precision, always retain the same granularity in the RRA. It'll scale well, until disk full :)

same granularity == have a RRA for each pixel in each graph you'll show.
< dipohl> how about multi-graphing plugins?
do you spare disk space also with them?
<@TheSnide> #8 use fast plugins.

that would be the multigraph plugins.

as, they manage to minimize the roundtrips and exec for many graphs

and, no, no diskspace spared with them. as they are just stored the same way than normal plugins

I'll let ssm speak more #3 and #3a in detail.
  * TheSnide hands the mic
<@TheSnide> [ the mic fell on the floor :) ]
<+ssm> *pok* *pok* *pok* this thing on? :)
  * ssm is going to say something about scaling with rrdcached…
<+ssm> Running a large munin master is all about herding bottlenecks. Memory, CPU and storage IO are candidates for "bottleneck of the day"
(and by "large", I mean "between 200 and 600 nodes", we hope to scale better as time goes on)
Munin used to be cron driven.
Then we got CGI.
Which then became "FastCGI". (quotation marks intended). FastCGI is great, but it was not really fast, nor particularly stable, until munin 2.x. (The FastCGI container would restart FastCGI scripts which died, leaving us with a missing image now and then)
A bit about the pieces of munin that need attention:
Connects to all the nodes, in parallel, and reads plugin config and values.
Writes to RRD files as fast as it can. A great candidate for IO bottlenecks.
When in cron mode, writes all static pages to a directory. This is really a disk killer. This is a single threaded process, which made it take a lot of time as well.
When in cgi mode, writes a storable file for munin-cgi-html.
When in cron mode, writes all the daily graph images to disk, every 5 minutes. Weekly, monthly and yearly graphs are graphed less often.
When you switch to "cgi" mode, you save a lot of continuous writing to disk. For shared storage, this counts big. :)
generates web pages, on demand.
Just an aside: munin-cgi-html does not read the configuration in /etc/munin, at all. It reads the data structures left by "munin-html", which runs every 5 minutes. That means:
If you change something, you need to run "munin-html". This will write a new storable, and munin-cgi-html will pick this up.
The next page load, after munin-html has run, will be slower, since the data structure in the storable will be loaded into the fastcgi process.
  * ssm had a multi-gigabyte storable file. Acceptable for FastCGI, but _not_ in CGI mode :)
<+ssm> On a master with a few hundred nodes, this will be noticeably slower. (As in: "*dammit* *punch reload page* "-slower) Not helping.
serve images on demand, but checks if the cache has a recent enough image.
If not: Generate graphing command for rrd, which should hopefully result in some sort of image, describing exactly what our original problem is, so we can solve it.
CGI and FastCGI is a tradeoff. Munin would not spend time graphing everything, and we'd wait a bit more for each graph to be generated.
We lose simplicity (serve static files from this directory).
We lose web page serving speed (images and pages are generated).
We gain capacity. (store raw data only. Do not read all RRD files, and do not write all images and pages every 5 minutes)
This enables us to scale to a larger amount of nodes per master.
then to the next item on the "how-to-scale" list:
Scaling with rrdcached
The RRD cache daemon helps with the following:
Writes: All the writes from munin-update are spooled, and RRD files are written in the background.
Reads: For any RRD files read, the RRD cache daemon will write any spooled data for them, and then say "OK, you can read that file".
(or serves the data, /me is a bit unclear about that, but it does not matter :)
The effect is dramatic: The image at describes what happened when I added rrdcached to a busy munin master.
That is storage on a pair of mirrored SSDs.
Note that this is a logarithmic graph.
Now, setting up rrdcached is not hard, but just installing it out-of-the-box is not enough.
What is needed?
An instance of rrdcached with the ability to read and write to the munin directory.
A line of configuration for munin, telling it to use rrdcached.
  * ssm suggests you get the rrdcached going before you configure munin :)
<+ssm> ---
now: Supervise this process
If rrdcached stops, munin will stop. Run this process supervised. That means:
If you've got Ubuntu, write an upstart config for it. If you've got Debian, you've lots of choices, including upstart, systemd, monit, runit.
if you got a RedHat based system, you also have a lot of choices.
if you're out of choices, write a cron job. Better than nothing :)
< TheSnide: good part is, if rrdcached dies, write will get slow, but still here. >
There are a few important flags to be used for rrdcached to know where to read and write:
you need to tell it about the munin directories, with "-B -b /var/lib/munin/ -j /var/lib/munin/rrdcached-journal/" (restrict it to the basedir, and write a journal to a convenient subdir)
A communications socket, for munin to connect to: "-m 0660 -l unix:/run/munin/rrdcached.sock". For FastCGI, the "www-data" user, or another user, need access as well.
then tell it to always flush data on shutdown with "-F". This means that on reboot, the disk will be busy. Don't keep a large backlog and expect fast boot. :)
the "-w $seconds" flag tells rrdcached to keep data for a certain amount of time before writing, and "-z $seconds" to add a random factor. Finally "-f $seconds" to flush all data periodically.
Using "rrdcached" to scale munin is documented at master/rrdcached.html -- Any contributions would be welcome. :)
<@TheSnide> clap clap clap !
< chteuchteu> * clap clap clap :) *
<+ssm> ( The draft of this talk is available at )

Q: < hron84 > ssm: how rrdcached tunes up graph creating? It can solve 'too much rrd' problem?

rrdcached takes the large "random write" load off the disks, so reads happen much faster.
<@TheSnide> ssm: also, as rrdcached only spools the string "UPDATE /bla/bla/bla.rrd" it doesn't need to *read* the header just yet
<+ssm> Q: < AndreiStaicu > you said something about huge rrdcache buffers and lots ram. how is that achieved? just by setting the -w high?
  * ssm will check rrdcached usage on a busy host before answering :)
<@TheSnide> AndreiStaicu: i used -w 3600 -z 3600, and that bumped the RAM usage a lot
<+ssm> on a 400 node master, rrdcached uses 81MB resident ram size with the settings at -- The rrdcached-journal _disk_ usage is 250MB.
<@TheSnide> while making the IO go almost nil.
<+ssm> â€¦with "-w 1800 -z 1800 -f 3600"
<@TheSnide> yes, it all depends on the number of RRD files.

therefore on the ratio rrd/node
<+ssm> TheSnide: is this with async nodes, with higher density than 5 minutes?
<@TheSnide> ssm: yup, 10s
  * ssm would think this increases rrd write pressure a bit :)
<+ssm> and rrdcached should take _well_ care of that

hron84 spotted a typo in the docs. Thanks :)
  * ssm put that typo in production on a friday afternoon last week before leaving in a hurry. _big_ hole in the graphs.
<@TheSnide> so, about the #3a, since ssm didn't mention it

#3a do NOT read from RRD yourself. EVER.
< ze-> what do you mean by read from rrd ? :)
<@TheSnide> munin takes *extra* care not to read needlessly from the rrd files

munin-limits was historically done with a call to "rrd last" which got the last updated value

that was horrible for rrdcached. as each rrd "read" operation connect to the socket, and asks for a flush in the said rrd file.

[ that's why the socket perm are *write* for any user that needs to read the rrd files ]

so, on the old behavior, each munin-cron run, munin-limits was effectively reading from the whole rrd files, asking to flush them *right now*

even when using a full cgi stack.

so, each rrd was only in flight for less than 5 min. negating most of the rrdcache interest

Q: < AndreiStaicu:#munin> is munin-limits really necessary, if i don't want alerts?

A: well, it also computed the nice colors and warning/critical sub views.

so, the munin-limits in munin2 has been refactored to make use a new file "state-*.storable" that stores the last value of each plugin.

that one is updated by munin-update, and so munin-limits doesn't need to hit the rrd anymore.

so, there's a nagios plugin laying around that directly interrogate the rrds.

it dates from the 1.4.x era.

it works well on a non rrdcached setup. but 2 things can happen on rrdcached :

#1 it isn't rrdcached aware, and has therefore only old data as it doesn't ask the cache to be flushed

#2 it is, and then has the side effect of flushing *every* cache
  * ze- like the #1 state :)
<@TheSnide> :)

that would end the #3a.

for the #3b ... [ reading backlog ]

it's the #1 i just mentioned.

you'll have old data. specially if you are caching aggressively. (high -w values)

[ for the record, #3b was "do *not* bypass rrdcached ]

just a note about the node perfs:

On the node side, you have to take extra care not to have slow plugins.

The stock vmstat is a pure example of what not to do : sleep() !

as, you'll hit another issue of munin scaling : the number of concurrent munin-update

So 2301 here, let's move to the last part of the conf

3. Rant Fest. A quite informal constructive brainstorm on what issues should be solved


for that it seems it already started bigtime in #munin :)

I'll write just several words on my plans for munin's future if there's any interest.
< chteuchteu> There is! :)
  * ssm agrees
<@TheSnide> first is to improve the docs.
i deeply agree that janl did a much better job than me on that part.
the good thing is that *you* can help. no coding skill needed :)
i'm seriously thinking about moving to a git+rst only doc, and progressively doing some 301 to the updated versions
(301 from the wiki, i mean)
as, the wiki is slowly showing its age, and mostly relevant to 1.4
it's also way easier to contribute to the rst now than to the SVN before. Mostly thanks to the nice UI from github.
second is to improve our bug triage.
the bug list is in a deeply sorry state. [ hint, don't go there to check. you'll be despaired ]
most bugs are either irrelevant now, or fixed.
many bugs are spam.
and a significant part of the bugs are real bugs.
< fenris02> TheSnide, wasnt readthedocs more or less a rewrite / update already?

or was it more of a copy/paste from old to new home?
<@TheSnide> fenris02: the rst was mostly ssm's work.
he rewrote many things, and copy/paste some other.
< dipohl> As long as there are munin instances of 1.4 in the wild
<@TheSnide> thing is, rst is geared toward 2.0. wiki is geared toward 1.4
< dipohl> I wouldn't close the wiki

set a warning, that this is old stuff
<@TheSnide> dipohl: idea is to redirect the wiki to rst. with a page that details the 2.0 stuff, and hints on what it was in 1.4
< dipohl> so hold the old pages?
<@TheSnide> old stuff warning is just a sign of "this project is dead"
< dipohl> in archive mode?
<@TheSnide> nah, copy/paster the page into rst, in a "1.4" section if different
< dipohl> of course you can say: "We have moved. Look for the current docs here"

eh there ;)
< chteuchteu> Maybe redirect wiki to rst, and move wiki content to rst ?
Ow, too slow :p
<@TheSnide> well, it's more a roadmap. details will be ironed out

third is more a feature

--> node-push.
< dipohl> concerning the wiki - it will be still the munin homepage or here a change also?
<@TheSnide> the url won't move :)

good url *never* move :)
< dipohl> I am not talking about the domain
but the application (Trac)
<@TheSnide> i know. dunno yet.
< dipohl> as entry point for Munins homepage
<@TheSnide> about node-push, i'm (slowly) working on a new feat, that enables node to "push" their payload at random times.

it has several steps :

#1 is async based.

#2 will be merged with munin-update to be event-based.

#1 proved to be quite cumbersome.

idea was to have a CGI that got the data from the HTTP POST, and wrote in a spooldir.

normal munin-async would read it at regular times.

code was simple to write, but it was quite difficult to secure correctly, and to have a easy configuration, well...

so, i investigated on the 2nd one.

it needs to "upgrade" the storable-based files to an SQL-based one. DBI-based one to be exact.

that would mean sqlite for small setups, and pgsql for bigger ones.

and it also enables random updates. no more the 5 min dictator.

regular nodes would still be polled at the 5 min rate, but by a munin-poller that would also push updates to munin-update

node & plugins will also get a "streaming" mode. that will enable sub 5min realtime feed back.

Q: < ssm:#munin> TheSnide: will munin-async be integrated into munin-node eventually?

A: yes.

the plugin will accept a new cmd "stream" that means they will continuously deliver values. only exiting upon "config" changes
< ze-> that will make various delay plugins easier to make
<@TheSnide> yup
< ze-> as they won't have to store all data themselves :)
<@TheSnide> exactly.

and 1s plugins will become wuite lightweight

4th part is UI.
< ze-> so, the munin-node will potentially have 1 connection per running plugin at once, plus one to the master(s) ?
<@TheSnide> it does suck. not as much as doc. but still :)

ze-: yes

i'd like to have an HTML5-only UI. all dynamic parts in JS.
< ze-> do you keep in mind having multiple masters for a single node ?
<@TheSnide> and dump the whole HTML CGI + templating
< ze-> I mean as a possibility
<@TheSnide> multiple master for 1 node will be quite difficult in the node-push way of things. it will be addresed by the node-async merge

as i'd like to limit the complexity of munin. As one of our great strengh is "ease of instal"

as more moving parts means more bugs. and the issue with bugs is not really fixing them, but finding them.

Last modified at 2013-07-16T18:12:47+02:00 Last modified on 2013-07-16T18:12:47+02:00