Tag Archives: performance

Monitoring Hadoop Clusters using free tools

Monitoring a Hadoop cluster properly can be a lot of work. Luckily there are lots of free tools with good documentation to get you started.
Here is a short list of what I use

Performance Monitoring

  • Cluster performance, Ganglia – Ganglia webpage
  • Individual servers graphing, Munin – Munin-webpage
  • pnp4Nagios will graph all Nagios-checks that support performance output – pnp4Nagios

Hardware and operating systems

  •  Nagios -with default OS-plugins.
    • I like ‘check_by_ssh’ instead of nrpe, makes it easy, and works out of the box
    • Make sure you get hardwareplugins for your hardware, and generic disk, cpu, memory etc
    • Use check_process to check as many processes as possible, that you know should run
    • Use check_tcp to check all ports that should be open
    • Use check_ntp to make sure your cluster is in sync time-wise
    • See my separate page for more info about plugins etc
    • If your cluster is large, have a look at my Large scale implementation page

Hadoop itself

  • Nagios-plugins exists for at least some of the hadoop/hdfs-stuff.
    • Check the hdfs. I made a dirty perl-script, that parses the output from the namenodes web management page, can be found here
    • This check will check free/used DFS space in the cluster, and also if any nodes are dead, if blocks are missing or under-replicated. It also outputs performancedata for nice graphs using pnp4nagios.
    • I also use a little script that does a fsck / of hdfs, it is really simple, and can be found here. This can i.e. be run every 15/30/60 minutes, depending if you want more load, or more checking 🙂
    • Check tasktrackers, use this script.

Missing anything? Let me know in a comment 🙂

Nagios | The Industry Standard in IT Infrastructure Monitoring