Monday, July 21, 2014

Pig script for finding max temprature


  Finding the max temperature using Pig script


The input data set can be obtain here :
   https://drive.google.com/file/d/0BwiqVGNpnBVIbDZ6Q1V1RThxYXc/edit?usp=sharing

My Hadoop path is  :  /usr/local/hadoop/
Hadoop user is        :  /home/hduser



Steps :
  • Start  Hadoop   hduser@kaustuv-studio14:/home/kaustuv$ /usr/local/hadoop/bin/start-all.sh 
  •  Copy & ensure input exist in HDFS   (use -copyFromLocal command )
                hduser@kaustuv-studio14:/usr/local/hadoop$ bin/hadoop dfs -ls /home/hduser/
                           ( This will list weather.txt  file )
  •  Start Pig grunt shell in MapReduce mode   hduser@kaustuv-studio14:/home/kaustuv$ pig
  •   Write the following Max temp pig script  
A = load '/home/hduser/weather.txt' AS (f1: chararray);
B = foreach A generate SUBSTRING(f1, 4, 8) AS (year: chararray), SUBSTRING(f1, 38,43) AS (temp: chararray) ;
C = group B by $0;
Max_temp = foreach C generate group,
MAX(B.temp);
store Max_temp INTO 'MAX_Temp_Output' ;


Internally pig script is converted  into MapReduce program we can check the progress of this MR program via  web interfaces of namenode & job tracker  also.

 Output will be stored under  MAX_Temp_Output folder inside users home directory here  '/user/hduser'.

  • Output can be verified  using 'cat' command 
hduser@kaustuv-studio14:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/MAX_Temp_New_Output/part-r-00000

This will list 

Warning: $HADOOP_HOME is deprecated.

1941    106.2
1942    183.9
1943    176.7
1944    156.2
1945    130.6
1946    152.3
1947    191.1
1948    175.9
1949    181.1
1950    208.8
1951    168.8
1952    122.6
1953    126.5
1954    232.3
1955    130.2
1956    114.6
1957    187.7
1958    184.5
1959    229.9
1960    204.7
1961    173.8
1962    130.8
1963    187.9
1964    144.3
1965    186.1
1966    155.9
1967    173.8
1968     93.8
1969    146.4
1970    181.4
1971    136.4
1972    128.5
1973    119.9
1974    203.2
1975    132.3
1976    157.8
1977    150.6
1978    140.0
1979    158.9
1980    119.3
1981    217.2
1982    141.2
1983    122.1
1984    154.2
1985    146.0
1986    187.9
1987    219.2
1988    164.0
1989    120.2
1990    118.0
1991    142.4
1992    149.3
1993    190.6
1994    157.4
1995    145.6
1996    107.2
1997    219.0
1998    125.2
1999    143.0
2000    195.0
2001    147.4
2002    180.2
2003    111.4
2004    168.8
2005    194.4
2006    153.8
2007    155.2
2008    140.0

 



 

Sunday, June 29, 2014

Steps to install Pig in Ubuntu


1. Download tar file from this link :
( Do check that pig version support installed Hadoop version)
e.g. My ubuntu  has Hadoop 1.2.1 installed and  have installed pig 0.12.1
Hadoop installed under : /usr/local/hadoop
hduser  (hadoop group user) is under : /home/hduser

2. Open terminal and logged in as hduser
                su hduser

3. Unpack the downloaded tar file using command
                tar xzf  pig-0.12.1.tar.gz
    This will install pig under /home/hduser/

4. Edit bash file 
               gedit ~/.bashrc

Append following line   
export PIG_HOME=/home/hduser/pig-0.12.1 
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=/usr/local/hadoop/conf

5.  In terminal type
               pig 
This will open grunt (interactive shell for running pig commands)  in Mapreduce mode.
(*Before starting pig make sure you have started Hadoop
 /usr/local/hadoop$  bin/start-all.sh    )



 Learning goes on forever :)
 

Wednesday, June 18, 2014

MapReduce MaxTemprature program execution in standalone mode using Eclipse

 

MapReduce program execution in standalone mode ( Eclipse )  

Finding the maximum temperature of each year


A Sample weather data row shown below :
1st Column is year and sixth is temperature (Fahrenheit )
+   1942   1    5.8     2.1    ---    114.0    58.0 
+   1942   5   14.0     6.9    ---    101.1   215.1


  MapReduce program consist of  :
  • Map Class  : Map Function
  • Reduce Class : Reduce Function
  • Driver Class  : Main Method


Map Function  :
Takes text file as input (TextInputFormat ),
for each line of input file it emits year and temperature as output key/ value pair

I/P key --> Byteoffset ( type : LongWritable)
I/p value  --> Line  ( type : Text)
o/p Key  -> year  ( type : text)
o/p value -> temperature ( type : float)

 Map Class :
public static class Map extends Mapper<LongWritable, Text, Text, FloatWritable>
    {
         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
        {
            String line = value.toString();
            String year = line.substring(4,9);
            float temp = Float.parseFloat(line.substring(38,43));
         
                context.write(new Text(year), new FloatWritable(temp));
          }
     
    }


Reduce function  :
Takes Map output as Input and emits maximum temperature of  every year as temp/year (as key /value) pair

I/p  key -> year  ( type : text)
I/p value ->  Temprature list  ( type :  Float )
o/p Key  -> year  ( type : text)
o/p value -> temperature ( type : float)



Reduce Class:

public static class Reduce extends Reducer<Text, FloatWritable, Text, FloatWritable>
    {
        public void reduce(Text key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException
        {
            float maxValue = Float.MIN_VALUE ;
            for (FloatWritable value : values)
            {
                maxValue = Math.max(maxValue,value.get());
            }
            context.write(key, new FloatWritable(maxValue) );
       }
    } 
 

Driver ( Main method) :

       public static void main(String[] args) throws Exception
        {
        Configuration conf = new Configuration();
          Job job = new Job(conf,"MaxTemp");
       
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FloatWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setJarByClass(MaxTemp.class);
        job.waitForCompletion(true);
    }  


Execution Steps in Eclipse :


Step 1 :
Open Eclipse : File -> New ->  Java Project
Enter project name : MaxTemp
Press Finish

Step 2 :
Under Package Explorer window
Right click on MaxTemp   New->Class
Enter  under Name  MaxTemp and check Public static void main ()
Finish

Step 3 :
Paste the below programe code under MaxTemp.java

import java.io.IOException;
import java.util.*;
import java.lang.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;


public class MaxTemp
{

    public static class Map extends Mapper<LongWritable, Text, Text, FloatWritable>
    {
         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
        {
            String line = value.toString();
            String year = line.substring(4,9);
            float temp = Float.parseFloat(line.substring(38,43));
         
                context.write(new Text(year), new FloatWritable(temp));
          }
     
    }

    public static class Reduce extends Reducer<Text, FloatWritable, Text, FloatWritable>
    {
        public void reduce(Text key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException
        {
            float maxValue = Float.MIN_VALUE ;
            for (FloatWritable value : values)
            {
                maxValue = Math.max(maxValue,value.get());
            }
            context.write(key, new FloatWritable(maxValue) );
       }
    } 

   
        public static void main(String[] args) throws Exception
        {
        Configuration conf = new Configuration();

        Job job = new Job(conf,"MaxTemp");
       
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FloatWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setJarByClass(MaxTemp.class);
        job.waitForCompletion(true);
    }      
}





Step 4 :
Package Explorer --> MaxTemp-->Build Path --> Configure Build Path --> Java Build Path --> libraries -> Add Library

Add following libraries and press OK
 


Step5 :
Under Package Explorer
 MaxTemp -> Run As -> Run Configuration

Under main tab edit
Name : MaxTemp
Project:  MaxTemp
Main Class : MaxTemp

Under argument tab edit
Program Arguments :
input output

Apply ->Close


Step 6 :
Create new folder as input under your MaxTemp directory in your workspace folder  ( my case its home/workspace/MaxTemp)

Step 7 :
Copy the sample data ( provided below) and store as a text file under input folder.

Step 8 :
Run the program and check the result under output folder( ../MaxTemp/output).



Sample weather data can be download from the link below :

https://drive.google.com/file/d/0BwiqVGNpnBVIbDZ6Q1V1RThxYXc/edit?usp=sharing


The project  can be obtain from my github dir :
https://github.com/kaustuvkunal/Bigdata/tree/master/MaxTemp


Happy Coding..