Friday, July 26, 2013

How to increase open files limit in Linux. Fix for java.io.FileNotFoundException (Too many open files) error

I was configuring a new build on Linux machine and it kept on failing with error

java.io.FileNotFoundException: /home/build/jenkins/workspace/OnPrem.AA-OnPrem.6.0.2.1-SP3-P4-HF50/configurations/target/classes/log4j.properties (Too many open files)

This error was occurring because our build is trying to utilize number of open File Descriptor beyond the limit.
By default in our RHEL 5.8 machine, the open Hard file limit was 1024. It can be checked with the command 
# ulimit -a|grep open      
open files (-n) 1024

or
# ulimit -Hn     
1024

#ulimit -Sn    
1024

Solution: Increase the Open file limit.  
Step 1:    vi /etc/security/limits.conf  
Step 2:  Add the below line in the end              
*                -       nofile          4096              
# End of file              

It will increase the open file limit to 4096 for all the users. 

Step 3: Verify it by running the below command in a new shell or session.
  # ulimit -a|grep open 

How to witness in a running build, a process is trying to overuse the file open limit?
Step 1: Start the build / Command. In my case I started the build from Jenkins
Step 2: Find out the process ID (PID) of the running task and note it down. In my case it is            ps -aef|grep java               
Step 3:  Then check the number of files under /proc/<PID>/fd folder            
  ls /proc/7455/fd|wc -l            
As the build progressed, I noticed the number of files keep on increasing from 20 to 1042 and immediately build failed with the above mentioned error.






Friday, July 5, 2013

What is Apache Hadoop?

Newbies can get a clean and simple introduction to Hadoop from the following Pivotal blog posts

1) Demystifying Apache Hadoop in 5 Pictures
2) Hadoop 101: Programming MapReduce with Native Libraries, Hive, Pig, and Cascading
3) 20+ Examples of Getting Results with Big Data

Highlights

  • Hadoop is developed to assist in Big data Analysis
  • Hadoop implements distributed computing to process large sets of data in a quick time-frame
  • Hadoop divides and distributes work across large number of computers. It spreads data processing logic across 10s, 100s, or even 1000s of commodity servers.

Hadoop's main components

1) Hadoop Distributed File System (HDFS) : to help us split the data, put it on different nodes, replicate it, and manage it.

2) MapReduce: processes the data on each node in parallel and calculates the results of the job
     a) Map: Performs computation on local data set on each nodes and outputs a list of key-value pairs
     b) Reduce: The output from map step is sent to other nodes as input for the reduce step. Before reduce runs, the key-value pairs are typically sorted and shuffled. The reduce phase then sums the lists into single entries

3) Managing the Hadoop jobs: In Hadoop the entire process is called a job.
    a) Job tracker exists to divide the job into tasks and schedules tasks to run on the nodes. The job tracker keeps track of the participating nodes, monitors the processes, orchestrates data flow, and handles failures. 
    b) Task trackers run tasks and report to the job tracker. 
  With this layer of management automation, Hadoop can automatically distribute jobs on a large number of nodes in parallel and scale when more nodes are added .

Hadoop programming

There are 4 coding approaches
1) Native Hadoop library: It helps to achieve the greatest performance and have the most fine-grained control
2) Pig: similar to SQL and it is procedural, not declarative
3) Hive: Started by Facebook. It provides more SQL like interface, considered the slower of the languages to do Hadoop with.
4) Cascading: It is a set of .jars that define data processing APIs, integration APIs, as well as a process planner and scheduler