Hadoop Tuning Notes

This is a quick dump of my notes for hadoop tuning.

General

Hadoop is designed to use multiple cores and disks, so it will be able to take full advantage of more powerful hardware.
Don’t use RAID for HDFS datanodes as redundancy is handled by HDFS. RAID should be used for namenodes as it provides protection against metadata corruption.
Machine running the namenode should run on a 64-bit system to avoid 3GB limit on JVM heap size.
In case the cluster consists of more than one rack, it is recommended to tell Hadoop about the network topology. Rack awareness will help Hadoop in calculating data locality while assigning MapReduce tasks and it will also help HDFS to choose replica location for files more intelligently.
Two processors will be engaged by datanode and tasktracker, and the remaining n-2 processors can have a factor of 1 to 2 extra jobs.
On Master node, each of namenode, jobtracker and secondary namenode takes 1000M of memory. If you have a large number of files than increase the JVM heap size for namenode and secondary namenode.

Configuration Parameters

HDFS

dfs.block.size: The block size used by HDFS which defaults to 64MB. On large clusters this can be increased to 128MB or 256MB to reduce memory requirements on namenode and also to increase the size of data given to map tasks.
dfs.name.dir: should give a list of directories where the namenode persists copies of data. It should be one or two local disks and a remote disk such as nfs mounted directory so that in case of node failure, metadata can be recovered from the remote disk.
dfs.data.dir: specifies the list of directories used by datanodes to store data. It should always be local disks and if there are multiple disks then each directory should be on different disk so as to maximize parallel read and writes.
fs.checkpoint.dir: list of directories where secondary namenode keeps checkpoints. It should use redundant disks for the same reason as dfs.name.dir
dfs.replication: number of copies of data to be maintained. It should be at least 2 more than the number of machines that are expected to fail everyday in the cluster.
dfs.access.time.precision: The precision in msec that access times are maintained. It this value is 0, no access time are maintained resulting in performance boosts. Also, Storage disks should be mounted with noatime which disables last access time updates during file reads resulting in considerable performance gains.
dfs.datanode.handler.count: Number of threads handling block requests. In case on multiple physical disks, the throughput can increase by increasing this number from default value of 3.
dfs.namenode.handler.count: Number of threads on Namenode. This number should be increase from default value of 10 for large clusters.

MapReduce

mapred.local.dir: is the list of directories where intermediate data and working files are store by the tasks. These should be a number of local disks to facilitate parallel IO. Also, these partitions should the same which are used by datanodes to store data (dfs.data.dir)
mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum: specifies the maximum number of map and reduce tasks that can be run at the same time. It should be a multiple of number of cores.
mapred.job.tracker.handler.count: Number of server threads for handling tasktracker request. Default value is 10 and the recommended value is 4% of the tasktracker nodes.
mapred.child.java.opts: increase the JVM heap memory of tasks that require more memory.

Others

core-site.xml::io.file.buffer.size: This is buffer size used by hadoop during IO which default to 4KB. On modern system it can be increased to 64KB or 128KB for performance gains.