Skip to main content

BIG DATA for .NET Devs: HDInsight, Writing Hadoop Map Reduce Jobs In C# And Querying Results Back Using LINQ

Azure HD Insight Services is a 100% Apache Hadoop implementation on top of Microsoft Azure cloud ecosystem. 

In this post, we’ll explore

  1. HDInsight/Hadoop on Azure in general and steps for starting with the same
  2. Writing Map Reduce Jobs for Hadoop using C# in particular to store results in HDFS.
  3. Transferring the result data from HDFS to Hive
  4. Reading the data back from the hive using C# and LINQ


If you are new to Hadoop and Big Data concepts, I suggest you to quickly check out

There are a couple of ways you can start with HDInsight.

Step 1: Setting up your instance locally in your Windows

For Development, I highly recommend you to install HDInsight developer version locally – You can find it straight inside the Web Platform installer.

Once you install the HDInsight locally, ensure you are running all the Hadoop services.


Also, you may use the following links once your cluster is up and running.

Here is the HDInsight dashboard running locally.


And now you are set.

Step 2: Install the Map Reduce package via Nuget

Let us explore how to write few Map Reduce jobs in C#. We’ll write a quick job to count namespaces from C# source files Earlier, in a couple of posts related to Hadoop on Azure - Analyzing some ‘Big Data’ using C# and Extracting Top 500 MSDN Links from Stack Overflow – I showed how to use C# Map Reduce Jobs with Hadoop Streaming to do some meaningful analytics. In this post, we’ll re-write the mapper and reducer leveraging the the new .NET SDK available, and will apply the same on few code files (you can apply that on any dataset).

The new .NET SDK for Hadoop makes it very easy to work with Hadoop from .NET – with more types for supporting Map Reduce Jobs, For creating LINQ to Hive queries etc. Also, the SDK provides an easier way to create and submit your own Map Reduce jobs directly in C# either to the local developer instance or to Azure Hadoop cluster. 

To start with, create a console project and install the Microsoft.Hadoop.Mapreduce package via Nuget.

Install-Package Microsoft.Hadoop.Mapreduce

This will add the required dependencies.

Step 3: Writing your Mapper and Reducer

The mapper will read the input from the HDFS file system, and the writer will emit outputs to HDFS. HDFS is Hadoop’s distributed file system, which guarantees high availability. Checkout the Apache HDFS architecture guide for details.

With Hadoop SDK, now you can inherit your Mapper from the MapperBase class, and Reducer from the ReducerCombinerBase class. This is equivalent to the independent Mapper and Reducer exes I demonstrated earlier using Hadoop streaming, just that we’ve got a better way of doing the same. In the Map method, we are just extracting the namespace declarations using reg ex to emit the same (See hadoop streaming details in my previous article)

    public class NamespaceMapper : MapperBase
        //Override the map method.
        public override void Map(string inputLine, MapperContext context)
            //Extract the namespace declarations in the Csharp files
            var reg = new Regex(@"(using)\s[A-za-z0-9_\.]*\;");
            var matches = reg.Matches(inputLine);

            foreach (Match match in matches)
                //Just emit the namespaces.

    public class NamespaceReducer : ReducerCombinerBase
        //Accepts each key and count the occurrances
        public override void Reduce(string key, IEnumerable<string> values, 
            ReducerCombinerContext context)
            //Write back  

Next, let us write a Map Reduce Job and configure the same.

Step 4: Writing your Namespace Counter Job

You can simply specify your Mapper and Reducer types and inherit from HadoopJob to create a job class. Here we go.

   //Our Namespace counter job
    public class NamespaceCounterJob : HadoopJob<NamespaceMapper, NamespaceReducer>
        public override HadoopJobConfiguration Configure(ExecutorContext context)
            var config = new HadoopJobConfiguration();
            config.InputPath = "input/CodeFiles";
            config.OutputFolder = "output/CodeFiles";
            return config;

Note that we are overriding the Configure method to specify the configuration parameters. In this case, we are specifying the input and output folders for our mapper/reducer – The lines in the files in input folder will be passed to our mapper instances, and the combined output from the reducer instances will be placed in the output folder.

Step 5: Submitting the job

Finally, we need to connect to the cluster and submit the job, using the ExecuteJob method. Here we go with the main driver.

class Program
        static void Main(string[] args)
            var hadoop = Hadoop.Connect();
            var result=hadoop.MapReduceJob.ExecuteJob<NamespaceCounterJob>();

We are invoking the ExecuteJob method using the NamespaceCounterJob type we just created. In this case, we are submitting the job locally – if you want to submit the job to an Azure HDInsight cluster for the actual execution scenario, you should pass the Azure connection parameters. Details here

Step 6: Executing the job

Before executing the job, you should prepare your input – in this case, you should copy the source code files in the input folder we provided as part of the configuration while creating our Job (see the  NamespaceCounterJob). To do this, fire up the Hadoop command line console from the desktop. If your cluster is on Azure, you can remote login to the cluster head node by choosing Remote Login from the HDInsight Dashboard.

  • Create a folder using the hadoop fs –mkdir input/CodeFiles command
  • Copy few CSharp files to your folder using hadoop fs –copyFromLocal your\path\*.cs  input/CodeFiles

See I’m copying all my CS files under BasicsRevisited folder to input/CodeFiles.


Now, build your project in Visual Studio, open the bin folder and execute your exe file. This will internally kick start MRRunner.exe and your map reduce job will get executed (The name of my executable is simply MapReduce.exe). You can see the detected file dependencies are automatically submitted.


Once the Map Reduce job is completed, you’ll find that the combined output will be placed in output/CodeFiles folder. You can issue the –ls and –cat commands to list the files and view the content of the part-00000 file where the output will be placed (Yes, a little Linux knowledge will help at times Winking smile). The part-00000 file contains the combined output of our task – see the name spaces along with their count from the files I submitted.


Step 7: Loading data from HDFS to Hive

As a next step, let us load the data from HDFS to Hadoop Hive so that we can query the same. We'll create a table using the CREATE TABLE hive syntax, and will load the data. You can run ‘hive’ command from the Hadoop command line to run the following statements.

CREATE TABLE nstable (
  namespace STRING,
  count INT)

LOAD DATA INPATH 'output/CodeFiles/part-00000' into table nstable;

And here is what you might see.


Now, you can read the data from the hive.

And there you go. Now you know everything about writing your own Hadoop Map Reduce Jobs in C#, load the data to the Hive, and query the same back in C# to visualize your data.  Happy Coding.

Popular posts from this blog

MVVM - Binding Multiple Radio Buttons To a single Enum Property in WPF

I had a property in my View Model, of an Enum type, and wanted to bind multiple radio buttons to this.

Firstly, I wrote a simple Enum to Bool converter, like this.

public class EnumToBoolConverter : IValueConverter { #region IValueConverter Members public object Convert(object value, Type targetType, object parameter, System.Globalization.CultureInfo culture) { if (parameter.Equals(value)) return true; else return false; } public object ConvertBack(object value, Type targetType, object parameter, System.Globalization.CultureInfo culture) { return parameter; } #endregion }

And my enumeration is like

public enum CompanyTypes { Type1Comp, Type2Comp, Type3Comp } Now, in my XAML, I provided the enumeration as the ConverterParameter, of the Converter we wrote earlier, like

Creating a quick Todo listing app on Windows using IIS7, Node.js and Mongodb

As I mentioned in my last post, more and more organizations are leaning towards Web Oriented Architecture (WOA) which are highly scalable. If you were exploring cool, scalable options to build highly performing web applications, you know what Node.js is for.After following the recent post from Scott Hanselman, I was up and running quickly with Node.js. In this post, I’ll explain step by step how I’ve setup Node.js and Mongodb to create a simple Todo listing application.Setting up Node.jsThis is what I’ve done.1 – Goto, scroll down and download node.exe for Windows, and place it in your c:\node folder2 – Goto IIS Node project in Git at, download the correct ‘retail’ link of IIS Node zip file (I downloaded the already built retail package, otherwise you can download and build from the source).3 – Extract the zip file some where, and run the install.bat or install_iisexpress.bat depending on your IIS Version. If you don’t have IIS in…

Top 7 Coding Standards & Guideline Documents For C#/.NET Developers

Some time back, I collated a list of 7 Must Read, Free EBooks for .NET Developers, and a lot of people found it useful. So, I thought about putting together a list of Coding Standard guidelines/checklists for .NET /C# developers as well.As you may already know, it is easy to come up with a document - the key is in implementing these standards in your organization, through methods like internal trainings, Peer Reviews, Check in policies, Automated code review tools etc. You can have a look at FxCop and/or StyleCop for automating the review process to some extent, and can customize the rules based on your requirements.Anyway, here is a list of some good Coding Standard Documents. They are useful not just from a review perspective - going through these documents can definitely help you and me to iron out few hidden glitches we might have in the programming portion of our brain. So, here we go, the listing is not in any specific order.1 – IDesign C# Coding StandardsIDesign C# coding stand…