Skip to main content

BIG DATA for .NET Devs: HDInsight, Writing Hadoop Map Reduce Jobs In C# And Querying Results Back Using LINQ

Azure HD Insight Services is a 100% Apache Hadoop implementation on top of Microsoft Azure cloud ecosystem. 

In this post, we’ll explore

  1. HDInsight/Hadoop on Azure in general and steps for starting with the same
  2. Writing Map Reduce Jobs for Hadoop using C# in particular to store results in HDFS.
  3. Transferring the result data from HDFS to Hive
  4. Reading the data back from the hive using C# and LINQ


If you are new to Hadoop and Big Data concepts, I suggest you to quickly check out

There are a couple of ways you can start with HDInsight.

Step 1: Setting up your instance locally in your Windows

For Development, I highly recommend you to install HDInsight developer version locally – You can find it straight inside the Web Platform installer.

Once you install the HDInsight locally, ensure you are running all the Hadoop services.


Also, you may use the following links once your cluster is up and running.

Here is the HDInsight dashboard running locally.


And now you are set.

Step 2: Install the Map Reduce package via Nuget

Let us explore how to write few Map Reduce jobs in C#. We’ll write a quick job to count namespaces from C# source files Earlier, in a couple of posts related to Hadoop on Azure - Analyzing some ‘Big Data’ using C# and Extracting Top 500 MSDN Links from Stack Overflow – I showed how to use C# Map Reduce Jobs with Hadoop Streaming to do some meaningful analytics. In this post, we’ll re-write the mapper and reducer leveraging the the new .NET SDK available, and will apply the same on few code files (you can apply that on any dataset).

The new .NET SDK for Hadoop makes it very easy to work with Hadoop from .NET – with more types for supporting Map Reduce Jobs, For creating LINQ to Hive queries etc. Also, the SDK provides an easier way to create and submit your own Map Reduce jobs directly in C# either to the local developer instance or to Azure Hadoop cluster. 

To start with, create a console project and install the Microsoft.Hadoop.Mapreduce package via Nuget.

Install-Package Microsoft.Hadoop.Mapreduce

This will add the required dependencies.

Step 3: Writing your Mapper and Reducer

The mapper will read the input from the HDFS file system, and the writer will emit outputs to HDFS. HDFS is Hadoop’s distributed file system, which guarantees high availability. Checkout the Apache HDFS architecture guide for details.

With Hadoop SDK, now you can inherit your Mapper from the MapperBase class, and Reducer from the ReducerCombinerBase class. This is equivalent to the independent Mapper and Reducer exes I demonstrated earlier using Hadoop streaming, just that we’ve got a better way of doing the same. In the Map method, we are just extracting the namespace declarations using reg ex to emit the same (See hadoop streaming details in my previous article)

    public class NamespaceMapper : MapperBase
        //Override the map method.
        public override void Map(string inputLine, MapperContext context)
            //Extract the namespace declarations in the Csharp files
            var reg = new Regex(@"(using)\s[A-za-z0-9_\.]*\;");
            var matches = reg.Matches(inputLine);

            foreach (Match match in matches)
                //Just emit the namespaces.

    public class NamespaceReducer : ReducerCombinerBase
        //Accepts each key and count the occurrances
        public override void Reduce(string key, IEnumerable<string> values, 
            ReducerCombinerContext context)
            //Write back  

Next, let us write a Map Reduce Job and configure the same.

Step 4: Writing your Namespace Counter Job

You can simply specify your Mapper and Reducer types and inherit from HadoopJob to create a job class. Here we go.

   //Our Namespace counter job
    public class NamespaceCounterJob : HadoopJob<NamespaceMapper, NamespaceReducer>
        public override HadoopJobConfiguration Configure(ExecutorContext context)
            var config = new HadoopJobConfiguration();
            config.InputPath = "input/CodeFiles";
            config.OutputFolder = "output/CodeFiles";
            return config;

Note that we are overriding the Configure method to specify the configuration parameters. In this case, we are specifying the input and output folders for our mapper/reducer – The lines in the files in input folder will be passed to our mapper instances, and the combined output from the reducer instances will be placed in the output folder.

Step 5: Submitting the job

Finally, we need to connect to the cluster and submit the job, using the ExecuteJob method. Here we go with the main driver.

class Program
        static void Main(string[] args)
            var hadoop = Hadoop.Connect();
            var result=hadoop.MapReduceJob.ExecuteJob<NamespaceCounterJob>();

We are invoking the ExecuteJob method using the NamespaceCounterJob type we just created. In this case, we are submitting the job locally – if you want to submit the job to an Azure HDInsight cluster for the actual execution scenario, you should pass the Azure connection parameters. Details here

Step 6: Executing the job

Before executing the job, you should prepare your input – in this case, you should copy the source code files in the input folder we provided as part of the configuration while creating our Job (see the  NamespaceCounterJob). To do this, fire up the Hadoop command line console from the desktop. If your cluster is on Azure, you can remote login to the cluster head node by choosing Remote Login from the HDInsight Dashboard.

  • Create a folder using the hadoop fs –mkdir input/CodeFiles command
  • Copy few CSharp files to your folder using hadoop fs –copyFromLocal your\path\*.cs  input/CodeFiles

See I’m copying all my CS files under BasicsRevisited folder to input/CodeFiles.


Now, build your project in Visual Studio, open the bin folder and execute your exe file. This will internally kick start MRRunner.exe and your map reduce job will get executed (The name of my executable is simply MapReduce.exe). You can see the detected file dependencies are automatically submitted.


Once the Map Reduce job is completed, you’ll find that the combined output will be placed in output/CodeFiles folder. You can issue the –ls and –cat commands to list the files and view the content of the part-00000 file where the output will be placed (Yes, a little Linux knowledge will help at times Winking smile). The part-00000 file contains the combined output of our task – see the name spaces along with their count from the files I submitted.


Step 7: Loading data from HDFS to Hive

As a next step, let us load the data from HDFS to Hadoop Hive so that we can query the same. We'll create a table using the CREATE TABLE hive syntax, and will load the data. You can run ‘hive’ command from the Hadoop command line to run the following statements.

CREATE TABLE nstable (
  namespace STRING,
  count INT)

LOAD DATA INPATH 'output/CodeFiles/part-00000' into table nstable;

And here is what you might see.


Now, you can read the data from the hive.

And there you go. Now you know everything about writing your own Hadoop Map Reduce Jobs in C#, load the data to the Hive, and query the same back in C# to visualize your data.  Happy Coding.

Popular posts from this blog

Top 7 Coding Standards & Guideline Documents For C#/.NET Developers

Some time back, I collated a list of 7 Must Read, Free EBooks for .NET Developers, and a lot of people found it useful. So, I thought about putting together a list of Coding Standard guidelines/checklists for .NET /C# developers as well.As you may already know, it is easy to come up with a document - the key is in implementing these standards in your organization, through methods like internal trainings, Peer Reviews, Check in policies, Automated code review tools etc. You can have a look at FxCop and/or StyleCop for automating the review process to some extent, and can customize the rules based on your requirements.Anyway, here is a list of some good Coding Standard Documents. They are useful not just from a review perspective - going through these documents can definitely help you and me to iron out few hidden glitches we might have in the programming portion of our brain. So, here we go, the listing is not in any specific order.1 – IDesign C# Coding StandardsIDesign C# coding stand…

Creating a quick Todo listing app on Windows using IIS7, Node.js and Mongodb

As I mentioned in my last post, more and more organizations are leaning towards Web Oriented Architecture (WOA) which are highly scalable. If you were exploring cool, scalable options to build highly performing web applications, you know what Node.js is for.After following the recent post from Scott Hanselman, I was up and running quickly with Node.js. In this post, I’ll explain step by step how I’ve setup Node.js and Mongodb to create a simple Todo listing application.Setting up Node.jsThis is what I’ve done.1 – Goto, scroll down and download node.exe for Windows, and place it in your c:\node folder2 – Goto IIS Node project in Git at, download the correct ‘retail’ link of IIS Node zip file (I downloaded the already built retail package, otherwise you can download and build from the source).3 – Extract the zip file some where, and run the install.bat or install_iisexpress.bat depending on your IIS Version. If you don’t have IIS in…

5 Awesome Learning Resources For Programmers (To help you and your kids to grow the geek neurons)

Happy New Year, this is my first post in 2012. I’ll be sharing few awesome learning resources I’ve bookmarked, and will be pointing out some specific computer/programming related courses I've found interesting from these resources.Also, thought about saving this blog post for my kids as well - instead of investing in these Child education schemes (though they are too small as of today, 2 years and 60 days respectively ). Anyway, personally my new year resolution is to see as much videos from this course collections (assuming I can find some free time in between my regular job && changing my babies diapers).1 – Khan AcademyAs I mentioned some time back, you and your kids are missing some thing huge if you havn’t heard about Khan Academy.  It is an awesome learning resource, especially if you want to re-visit your basics in Math, Science etc.With a library of over 2,600 videos covering everything from arithmetic to physics, finance, and history and 268 practice exercises, th…