Tag Archives: tuning

Nagios for large scale installations

We have installed Nagios in larger companies, with 1000+ devices/hosts and 10000+ service checks. Most of the tuning-info on the internet is focuced on smaller installations, with comments like “this will never be a problem, unless you have several hundred hosts.) Well, lots of people do. We also run lots of performance graphing using pnp4nagios, mrtg for all network devices and smokeping.

Here are a few key points to think about when planning your big installation:

  • Buy a new server. You can run linux on old hardware, but in a large environment, you will hit the wall running with an old server.
  • Use vanilla nagios, not a clone like opsview, with extended features. Opsview is driven by a MySQL database, and does not scale as well as a plain install. It is pretty though, and the web management is a very nice feature. But, everything has a cost… This one costs performance.
  • Buy CPUs with high clock frequency, rather then several cores. Buy two, and cores is also nice, but not at the cost of frequency.
  • RAM is cheap, 12 or 24 GB or so should be plenty. (Make sure that if you get a new processor -it will have memory in 3x, not 2x as before. So 6,12,18,24 GB etc.
  • Disk IO. This is your most important consideration. You do not need much disk, but you need fast disks. Either get several 15K SAS drives in Raid10, or just go for two SSD-drives in Raid1. 2*SSD might be cheaper then 10*SAS, and you do not need the space anyways.
  • Multiple Gigabit network cards using bonding (Network-IO and IRQs will be a problem at some point) At least 2* Gigabit, maybe more.
  • Read this Nagios Performance Tuning
  • Ignore the last point about hardware not being an issue :)
  • With the tips from the article above, and a reasonable new box, you should have no problems monitoring at least 1000-2000 hosts and 10000-20000 services. Anyone with higher numbers, let me know what hardware you are using!
  • pnp4nagios, mrtg and smokeping will also generate some IO, but for the numbers above, it should work OK. If you still have problems, you might need to think about splitting some of the testing to a seperate box, and deliver the results via passive checks to the main boks. Mrtg/smokeping can also be smart to move to another box.
  • A few smaller things to remember:
    • Make sure your plugins has as little overhead as possible. This means use compiled code ehen you can, if not possible, use perl, and make sure the tests supports the Nagios embedded perl. Try not to use shell/bash-scripts if at all possible
    • Turn down the frequency of tests that does not have to be run often. i.e. version checks, every 12 hour or similar. (check_interval   xxx) xxx in minutes
    • Turn off performance data processing on tests that does not support performance output (process_perf_data   0)
    • Did I forget anything ? -please leave a comment with suggestions!

Nagios | The Industry Standard in IT Infrastructure Monitoring