Monitoring a Hadoop cluster properly can be a lot of work. Luckily there are lots of free tools with good documentation to get you started.
Here is a short list of what I use
Performance Monitoring
- Cluster performance, Ganglia – Ganglia webpage
- Individual servers graphing, Munin – Munin-webpage
- pnp4Nagios will graph all Nagios-checks that support performance output – pnp4Nagios
Hardware and operating systems
- Nagios -with default OS-plugins.
- I like ‘check_by_ssh’ instead of nrpe, makes it easy, and works out of the box
- Make sure you get hardwareplugins for your hardware, and generic disk, cpu, memory etc
- Use check_process to check as many processes as possible, that you know should run
- Use check_tcp to check all ports that should be open
- Use check_ntp to make sure your cluster is in sync time-wise
- See my separate page for more info about plugins etc
- If your cluster is large, have a look at my Large scale implementation page
Hadoop itself
- Nagios-plugins exists for at least some of the hadoop/hdfs-stuff.
- Check the hdfs. I made a dirty perl-script, that parses the output from the namenodes web management page, can be found here
- This check will check free/used DFS space in the cluster, and also if any nodes are dead, if blocks are missing or under-replicated. It also outputs performancedata for nice graphs using pnp4nagios.
- I also use a little script that does a fsck / of hdfs, it is really simple, and can be found here. This can i.e. be run every 15/30/60 minutes, depending if you want more load, or more checking 🙂
- Check tasktrackers, use this script.
Missing anything? Let me know in a comment 🙂
Monitoring a Hadoop cluster properly can be a lot of work. Luckily there are lots of free tools with good documentation to get you started.
Here is a short list of what I use
Performance Monitoring
- Cluster performance, Ganglia – Ganglia webpage
- Individual servers graphing, Munin – Munin-webpage
- pnp4Nagios will graph all Nagios-checks that support performance output – pnp4Nagios
Hardware and operating systems
- Nagios -with default OS-plugins.
- I like ‘check_by_ssh’ instead of nrpe, makes it easy, and works out of the box
- Make sure you get hardwareplugins for your hardware, and generic disk, cpu, memory etc
- Use check_process to check as many processes as possible, that you know should run
- Use check_tcp to check all ports that should be open
- Use check_ntp to make sure your cluster is in sync time-wise
- See my separate page for more info about plugins etc
- If your cluster is large, have a look at my Large scale implementation page
Hadoop itself
- Nagios-plugins exists for at least some of the hadoop/hdfs-stuff.
- Check the hdfs. I made a dirty perl-script, that parses the output from the namenodes web management page, can be found here. This check will check free/used DFS space in the cluster, and also if any nodes are dead, if blocks are missing or under-replicated. It also outputs performancedata for nice graphs using pnp4nagios.
- Check tasktracker status. I made this perl script, that checks tasktrackers from the tasktrackers admin-wep-page. Yet-another-parsing-of-html-script. It can be found here, and will check the status, and number of active machines. Supports performance output.
- I also use a little script that does a fsck / of hdfs, it is really simple, and can be found here. This can i.e. be run every 15/30/60 minutes, depending if you want more load, or more checking 🙂