Skip to main content

BIG DATA for .NET Devs: HDInsight, Writing Hadoop Map Reduce Jobs In C# And Querying Results Back Using LINQ

Azure HD Insight Services is a 100% Apache Hadoop implementation on top of Microsoft Azure cloud ecosystem. 

In this post, we’ll explore

  1. HDInsight/Hadoop on Azure in general and steps for starting with the same
  2. Writing Map Reduce Jobs for Hadoop using C# in particular to store results in HDFS.
  3. Transferring the result data from HDFS to Hive
  4. Reading the data back from the hive using C# and LINQ


If you are new to Hadoop and Big Data concepts, I suggest you to quickly check out

There are a couple of ways you can start with HDInsight.

Step 1: Setting up your instance locally in your Windows

For Development, I highly recommend you to install HDInsight developer version locally – You can find it straight inside the Web Platform installer.

Once you install the HDInsight locally, ensure you are running all the Hadoop services.


Also, you may use the following links once your cluster is up and running.

Here is the HDInsight dashboard running locally.


And now you are set.

Step 2: Install the Map Reduce package via Nuget

Let us explore how to write few Map Reduce jobs in C#. We’ll write a quick job to count namespaces from C# source files Earlier, in a couple of posts related to Hadoop on Azure - Analyzing some ‘Big Data’ using C# and Extracting Top 500 MSDN Links from Stack Overflow – I showed how to use C# Map Reduce Jobs with Hadoop Streaming to do some meaningful analytics. In this post, we’ll re-write the mapper and reducer leveraging the the new .NET SDK available, and will apply the same on few code files (you can apply that on any dataset).

The new .NET SDK for Hadoop makes it very easy to work with Hadoop from .NET – with more types for supporting Map Reduce Jobs, For creating LINQ to Hive queries etc. Also, the SDK provides an easier way to create and submit your own Map Reduce jobs directly in C# either to the local developer instance or to Azure Hadoop cluster. 

To start with, create a console project and install the Microsoft.Hadoop.Mapreduce package via Nuget.

Install-Package Microsoft.Hadoop.Mapreduce

This will add the required dependencies.

Step 3: Writing your Mapper and Reducer

The mapper will read the input from the HDFS file system, and the writer will emit outputs to HDFS. HDFS is Hadoop’s distributed file system, which guarantees high availability. Checkout the Apache HDFS architecture guide for details.

With Hadoop SDK, now you can inherit your Mapper from the MapperBase class, and Reducer from the ReducerCombinerBase class. This is equivalent to the independent Mapper and Reducer exes I demonstrated earlier using Hadoop streaming, just that we’ve got a better way of doing the same. In the Map method, we are just extracting the namespace declarations using reg ex to emit the same (See hadoop streaming details in my previous article)

    public class NamespaceMapper : MapperBase
        //Override the map method.
        public override void Map(string inputLine, MapperContext context)
            //Extract the namespace declarations in the Csharp files
            var reg = new Regex(@"(using)\s[A-za-z0-9_\.]*\;");
            var matches = reg.Matches(inputLine);

            foreach (Match match in matches)
                //Just emit the namespaces.

    public class NamespaceReducer : ReducerCombinerBase
        //Accepts each key and count the occurrances
        public override void Reduce(string key, IEnumerable<string> values, 
            ReducerCombinerContext context)
            //Write back  

Next, let us write a Map Reduce Job and configure the same.

Step 4: Writing your Namespace Counter Job

You can simply specify your Mapper and Reducer types and inherit from HadoopJob to create a job class. Here we go.

   //Our Namespace counter job
    public class NamespaceCounterJob : HadoopJob<NamespaceMapper, NamespaceReducer>
        public override HadoopJobConfiguration Configure(ExecutorContext context)
            var config = new HadoopJobConfiguration();
            config.InputPath = "input/CodeFiles";
            config.OutputFolder = "output/CodeFiles";
            return config;

Note that we are overriding the Configure method to specify the configuration parameters. In this case, we are specifying the input and output folders for our mapper/reducer – The lines in the files in input folder will be passed to our mapper instances, and the combined output from the reducer instances will be placed in the output folder.

Step 5: Submitting the job

Finally, we need to connect to the cluster and submit the job, using the ExecuteJob method. Here we go with the main driver.

class Program
        static void Main(string[] args)
            var hadoop = Hadoop.Connect();
            var result=hadoop.MapReduceJob.ExecuteJob<NamespaceCounterJob>();

We are invoking the ExecuteJob method using the NamespaceCounterJob type we just created. In this case, we are submitting the job locally – if you want to submit the job to an Azure HDInsight cluster for the actual execution scenario, you should pass the Azure connection parameters. Details here

Step 6: Executing the job

Before executing the job, you should prepare your input – in this case, you should copy the source code files in the input folder we provided as part of the configuration while creating our Job (see the  NamespaceCounterJob). To do this, fire up the Hadoop command line console from the desktop. If your cluster is on Azure, you can remote login to the cluster head node by choosing Remote Login from the HDInsight Dashboard.

  • Create a folder using the hadoop fs –mkdir input/CodeFiles command
  • Copy few CSharp files to your folder using hadoop fs –copyFromLocal your\path\*.cs  input/CodeFiles

See I’m copying all my CS files under BasicsRevisited folder to input/CodeFiles.


Now, build your project in Visual Studio, open the bin folder and execute your exe file. This will internally kick start MRRunner.exe and your map reduce job will get executed (The name of my executable is simply MapReduce.exe). You can see the detected file dependencies are automatically submitted.


Once the Map Reduce job is completed, you’ll find that the combined output will be placed in output/CodeFiles folder. You can issue the –ls and –cat commands to list the files and view the content of the part-00000 file where the output will be placed (Yes, a little Linux knowledge will help at times Winking smile). The part-00000 file contains the combined output of our task – see the name spaces along with their count from the files I submitted.


Step 7: Loading data from HDFS to Hive

As a next step, let us load the data from HDFS to Hadoop Hive so that we can query the same. We'll create a table using the CREATE TABLE hive syntax, and will load the data. You can run ‘hive’ command from the Hadoop command line to run the following statements.

CREATE TABLE nstable (
  namespace STRING,
  count INT)

LOAD DATA INPATH 'output/CodeFiles/part-00000' into table nstable;

And here is what you might see.


Now, you can read the data from the hive.

And there you go. Now you know everything about writing your own Hadoop Map Reduce Jobs in C#, load the data to the Hive, and query the same back in C# to visualize your data.  Happy Coding.

Popular posts from this blog

Top 7 Coding Standards & Guideline Documents For C#/.NET Developers

Some time back, I collated a list of 7 Must Read, Free EBooks for .NET Developers, and a lot of people found it useful. So, I thought about putting together a list of Coding Standard guidelines/checklists for .NET /C# developers as well.As you may already know, it is easy to come up with a document - the key is in implementing these standards in your organization, through methods like internal trainings, Peer Reviews, Check in policies, Automated code review tools etc. You can have a look at FxCop and/or StyleCop for automating the review process to some extent, and can customize the rules based on your requirements.Anyway, here is a list of some good Coding Standard Documents. They are useful not just from a review perspective - going through these documents can definitely help you and me to iron out few hidden glitches we might have in the programming portion of our brain. So, here we go, the listing is not in any specific order.1 – IDesign C# Coding StandardsIDesign C# coding stand…

5 Awesome Learning Resources For Programmers (To help you and your kids to grow the geek neurons)

Happy New Year, this is my first post in 2012. I’ll be sharing few awesome learning resources I’ve bookmarked, and will be pointing out some specific computer/programming related courses I've found interesting from these resources.Also, thought about saving this blog post for my kids as well - instead of investing in these Child education schemes (though they are too small as of today, 2 years and 60 days respectively ). Anyway, personally my new year resolution is to see as much videos from this course collections (assuming I can find some free time in between my regular job && changing my babies diapers).1 – Khan AcademyAs I mentioned some time back, you and your kids are missing some thing huge if you havn’t heard about Khan Academy.  It is an awesome learning resource, especially if you want to re-visit your basics in Math, Science etc.With a library of over 2,600 videos covering everything from arithmetic to physics, finance, and history and 268 practice exercises, th…

Hack Raspberry Pi – How To Build Apps In C#, WinForms and ASP.NET Using Mono In Pi

Recently I was doing a bit of R&D related to finding a viable, low cost platform for client nodes. Obviously, I came across Raspberry Pi, and found the same extremely interesting. Now, the missing piece of the puzzle was how to get going using C# and .NET in the Pi. C# is a great language, and there are a lot of C# developers out there in the wild who are interested in the Pi.In this article, I’ll just document my findings so far, and will explain how develop using C# leveraging Mono in a Raspberry Pi. Also, we’ll see how to write few minimal Windows Forms & ASP.NET applications in the Pie as well.Step 1: What is Raspberry Pi?Raspberry Pi is an ARM/Linux box for just ~ $30. It was introduced with a vision to teach basic computer science in schools. How ever, it got a lot of attention from hackers all around the world, as it is an awesome low cost platform to hack and experiment cool ideas as Pi is almost a full fledged computer.  More About R-Pi From Wikipedia.The Raspberry Pi