BIG DATA for .NET Devs: HDInsight, Writing Hadoop Map Reduce Jobs In C# And Querying Results Back Using LINQ


ANOOP MADHUSUDANAN

Vote on HN

Azure HD Insight Services is a 100% Apache Hadoop implementation on top of Microsoft Azure cloud ecosystem. 

In this post, we’ll explore

  1. HDInsight/Hadoop on Azure in general and steps for starting with the same
  2. Writing Map Reduce Jobs for Hadoop using C# in particular to store results in HDFS.
  3. Transferring the result data from HDFS to Hive
  4. Reading the data back from the hive using C# and LINQ

Preface

If you are new to Hadoop and Big Data concepts, I suggest you to quickly check out

There are a couple of ways you can start with HDInsight.

Step 1: Setting up your instance locally in your Windows

For Development, I highly recommend you to install HDInsight developer version locally – You can find it straight inside the Web Platform installer.

Once you install the HDInsight locally, ensure you are running all the Hadoop services.

image

Also, you may use the following links once your cluster is up and running.

Here is the HDInsight dashboard running locally.

image

And now you are set.

Step 2: Install the Map Reduce package via Nuget

Let us explore how to write few Map Reduce jobs in C#. We’ll write a quick job to count namespaces from C# source files Earlier, in a couple of posts related to Hadoop on Azure - Analyzing some ‘Big Data’ using C# and Extracting Top 500 MSDN Links from Stack Overflow – I showed how to use C# Map Reduce Jobs with Hadoop Streaming to do some meaningful analytics. In this post, we’ll re-write the mapper and reducer leveraging the the new .NET SDK available, and will apply the same on few code files (you can apply that on any dataset).

The new .NET SDK for Hadoop makes it very easy to work with Hadoop from .NET – with more types for supporting Map Reduce Jobs, For creating LINQ to Hive queries etc. Also, the SDK provides an easier way to create and submit your own Map Reduce jobs directly in C# either to the local developer instance or to Azure Hadoop cluster. 

To start with, create a console project and install the Microsoft.Hadoop.Mapreduce package via Nuget.

Install-Package Microsoft.Hadoop.Mapreduce

This will add the required dependencies.

Step 3: Writing your Mapper and Reducer

The mapper will read the input from the HDFS file system, and the writer will emit outputs to HDFS. HDFS is Hadoop’s distributed file system, which guarantees high availability. Checkout the Apache HDFS architecture guide for details.

With Hadoop SDK, now you can inherit your Mapper from the MapperBase class, and Reducer from the ReducerCombinerBase class. This is equivalent to the independent Mapper and Reducer exes I demonstrated earlier using Hadoop streaming, just that we’ve got a better way of doing the same. In the Map method, we are just extracting the namespace declarations using reg ex to emit the same (See hadoop streaming details in my previous article)

    //Mapper
    public class NamespaceMapper : MapperBase
    {
        //Override the map method.
        public override void Map(string inputLine, MapperContext context)
        {
            //Extract the namespace declarations in the Csharp files
            var reg = new Regex(@"(using)\s[A-za-z0-9_\.]*\;");
            var matches = reg.Matches(inputLine);

            foreach (Match match in matches)
            {
                //Just emit the namespaces.
                context.EmitKeyValue(match.Value,"1");
            }
        }
    }

    //Reducer
    public class NamespaceReducer : ReducerCombinerBase
    {
        //Accepts each key and count the occurrances
        public override void Reduce(string key, IEnumerable<string> values, 
            ReducerCombinerContext context)
        {
            //Write back  
            context.EmitKeyValue(key,values.Count().ToString());
        }
    }

Next, let us write a Map Reduce Job and configure the same.

Step 4: Writing your Namespace Counter Job

You can simply specify your Mapper and Reducer types and inherit from HadoopJob to create a job class. Here we go.

   //Our Namespace counter job
    public class NamespaceCounterJob : HadoopJob<NamespaceMapper, NamespaceReducer>
    {
        public override HadoopJobConfiguration Configure(ExecutorContext context)
        {
            var config = new HadoopJobConfiguration();
            config.InputPath = "input/CodeFiles";
            config.OutputFolder = "output/CodeFiles";
            return config;
        }
    }

Note that we are overriding the Configure method to specify the configuration parameters. In this case, we are specifying the input and output folders for our mapper/reducer – The lines in the files in input folder will be passed to our mapper instances, and the combined output from the reducer instances will be placed in the output folder.

Step 5: Submitting the job

Finally, we need to connect to the cluster and submit the job, using the ExecuteJob method. Here we go with the main driver.

class Program
    {
        static void Main(string[] args)
        {
            var hadoop = Hadoop.Connect();
            var result=hadoop.MapReduceJob.ExecuteJob<NamespaceCounterJob>();
        }
    }

We are invoking the ExecuteJob method using the NamespaceCounterJob type we just created. In this case, we are submitting the job locally – if you want to submit the job to an Azure HDInsight cluster for the actual execution scenario, you should pass the Azure connection parameters. Details here

Step 6: Executing the job

Before executing the job, you should prepare your input – in this case, you should copy the source code files in the input folder we provided as part of the configuration while creating our Job (see the  NamespaceCounterJob). To do this, fire up the Hadoop command line console from the desktop. If your cluster is on Azure, you can remote login to the cluster head node by choosing Remote Login from the HDInsight Dashboard.

  • Create a folder using the hadoop fs –mkdir input/CodeFiles command
  • Copy few CSharp files to your folder using hadoop fs –copyFromLocal your\path\*.cs  input/CodeFiles

See I’m copying all my CS files under BasicsRevisited folder to input/CodeFiles.

image

Now, build your project in Visual Studio, open the bin folder and execute your exe file. This will internally kick start MRRunner.exe and your map reduce job will get executed (The name of my executable is simply MapReduce.exe). You can see the detected file dependencies are automatically submitted.

image

Once the Map Reduce job is completed, you’ll find that the combined output will be placed in output/CodeFiles folder. You can issue the –ls and –cat commands to list the files and view the content of the part-00000 file where the output will be placed (Yes, a little Linux knowledge will help at times Winking smile). The part-00000 file contains the combined output of our task – see the name spaces along with their count from the files I submitted.

image

Step 7: Loading data from HDFS to Hive

As a next step, let us load the data from HDFS to Hadoop Hive so that we can query the same. We'll create a table using the CREATE TABLE hive syntax, and will load the data. You can run ‘hive’ command from the Hadoop command line to run the following statements.

CREATE TABLE nstable (
  namespace STRING,
  count INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

LOAD DATA INPATH 'output/CodeFiles/part-00000' into table nstable;

And here is what you might see.

image

Now, you can read the data from the hive.

And there you go. Now you know everything about writing your own Hadoop Map Reduce Jobs in C#, load the data to the Hive, and query the same back in C# to visualize your data.  Happy Coding.

© 2012. All Rights Reserved. Amazedsaint.com