Essential Terminologies to start a conversation about the Container Ecosystem

You heard the news - Embracing the world of Micro Services and Containers is essential to help you make your organisation's IT operations more agile, providing immediate operational benefits. Let us have a look at few essential concepts to help you start a conversation with your own internal IT, Developers and Vendors.

Micro services

Well, as Martin Fowler put it, "Micro-services - yet another new term on the crowded streets of software architecture”, is what everybody is talking about.
In simple terms, Micro service architecture enables packaging (mentally and physically) each unit of functionality into a service, and you can distribute and scale these services independently. In a traditional monolithic web or enterprise application, if you need to change a simple functionality, you have to rebuild and redeploy the whole application. In a Micro service architecture, you can individually deploy and scale services. 
Now, this has got multiple advantages. You can scale only the services you need to distribute the load effectively - i.e, if you see that your customers are using your Order service more than others, you can scale up only the Order service instances from 10 to 20. Though this is nothing new, the evolution of container technologies accelerated Micro service based systems, and enabled organizations to adopt a very agile, continuous delivery based workflow to build and deploy applications faster.


‘Container’ is probably the most abused term in this year after the term ‘locker room talk’. In it’s original sense as it is used today in Dev Ops, the term emerged from LXC (Linux Containers). LXC is an OS level virtualization method for running multiple isolated Linux systems (containers) on a control host using a single Linux kernel.
What is the difference between LXC and Virtual Machines? As you are aware, standard virtualization systems (like KVM, VirtualBox etc) lets you boot full operating systems of different kinds, even non-Linux systems. The main difference between this and Linux Containers is that virtual machine require a separate kernel instance to run on - i.e almost a full standalone OS.
How ever, multiple LXCs can be deployed on top of the same Kernel (Ah, Microsoft didn’t see that coming oops) - So, LXCs are much cheaper to create and destroy (from a memory and processor foot print perspective) compared to Virtual Machines. One Linux Container (LXC) can run a single process, and as long as you don’t give root permissions for the process you run, you can impart some level of security to the process your container is running. To be fancy - you can group containers to Pods, and run them in Nodes. (Side note - Microsoft recently announced Windows Containers -have a look)
Platforms like Docker provides an easy workflow for developers to package their application to a container ‘image’, to spin off instances of this container later in a very easy way.


In simple terms - containers that need to co-exist in the same Kernel/Virtual Machine/Node and related run time information is grouped together as a Pod. So, a Pod is essentially a group of containers that should co-exist. And typically an application running in one container can access another container via the ‘localhost’ as long as both the containers are in the same pod. Containers with in the same pod will also mostly share the same storage context - much like two applications running in a virtual machine.


A Volume is an abstraction you can use for storage, and can be used by containers to read/write data. So, containers of the same Pod can use a ‘Volume’. From Kubrnetes perspective, Volumes are attached to Pods - so even if a container crashes, the files etc related to restart the container can be kept in the Volume. But when you remove/delete a Pod, normally you throw away the volume related to the Pod as well (in simple scenarios).


You can consider a Node as a worker machine (either a Virtual Machine or a bare metal physical machine). Nodes can run Pods and multiple nodes are managed by one or more master nodes to form a cluster.


A cluster is a large group of containers, some of them grouped into pods and some of them not. A cluster normally has one or more master nodes that manages the pods/containers deployed in the nodes - the master is responsible for ensuring the requested number of container instances are up and running all the time, and also providing API access to the containers in the cluster.

Cluster Federations

Typically, a cluster runs in a single on premise data center, or in a single availability zone in case of cloud providers - now what if these clusters can be tied to each other and federate them? This will enable interesting use cases like ability to overflow your work loads from one cluster to another. For example, an application can run in a private/one-premise cloud and burst into a public cloud when the demand of compute overflows a specific limit (typically mentioned as Cloud-Busting). The easiest way is to start with Kubernetes Cluster Federations

Why Micro services Pattern Love Containers?

As containers are easy to spin up and down, this became the favorite model of packaging and shipping your micro services. You can create a service, and package it to a container - and deploy them independently. Docker because so popular because of its ability to build, package and deploy applications/services using a light weight container. You can use a Docker image to spin off multiple container instances.
Kubernetes, Docker Swarm etc went one step further, allowing you to define and deploy containers at scale to form a whole cluster of pods with containers. For example, container orchestration engines like Kubernetes will let you specify the whole cluster configuration - including how many containers you need per service/application and how exactly they should talk to each other.
So, start from here and think how to be more agile - and re-architect your own enterprise to build and deliver business benefits faster, in an agile way, embracing containerization .
PB- This is a very evolving space, and there are lot of players and platforms in the market. Most of the time an apple to apple comparison is not possible between the tools and platforms. But if you are looking for going one step further,  have a look at the container platforms like OpenShift(, Cloud Foundry( etc. Macro level platforms/orchestration tools like Fabric8 are also becoming mature - allowing you to spin off your entire dev-ops pipe line as a platform - and optimize and manage everything using a unified user experience.

The Point Of Technological Singularity !!

What all decisions you made today were influenced by some kind of algorithms?

Image result for singularity

When you asked google maps to show you the shortest driving route? Or when you asked Siri to show you the hotel for your break fast? Or when you checked Flip board or Twitter to find the recent stories to satisfy your intellectual appetite? Or when you found your date based on those recommendations and profile matching?
Wait, how exactly you found this article and why you are reading it now?
Today, most of our decisions are made with the help of apps. Or in other words, these apps and the related algorithms, may be some where in the cloud wired to them, influence us in almost all our decisions, modifying or seriously impacting our behaviour. They influence our thoughts and decisions by making suggestions, deciding what information we see or don't see - even deciding whom should we follow or date. And every day, a lot of people including me spend most of their time enriching these algorithms, and building new apps using them, to help (influence) all of us, based on social data, past behaviour and what not.
All Hail Google Now, Siri, and Cortana. And those recommendations and ads springing up from every where persuading you . And the algorithms behind all of them.
So, you could theoretically argue that, we are at a point in history where a connected human being's nature is seriously influenced and/or modified by the complex nexus of apps and algorithms.
Now From Wikipedia : Technological Singularity
The technological singularity, or simply the singularity, is a hypothetical moment in time when artificial intelligence will have progressed to the point of a greater-than-human intelligence, radically changing civilisation, and perhaps human nature
This post is not to alert that SkyNet will take over tomorrow. Also, the intention is not to state that it is either "good' or 'bad', which are two mediocre relative terms.
You still have the choice to shut off, but the persuasion to do things easy by delegating it to an app wins almost all the time. And I think that is fine - as long as we carefully exercise our free will to make the final decision.
Or we already started trusting apps more than our own intelligence?
What do you think?

Exploring Some Of The Probable C# 6.0 Features

In my last post, we explored how to create a tiny Roslyn app to compile C# code, to test out some of the new C# 6.0 features.

Read it here

Now go ahead and play with some of the preview features. Try things out. Create a file with some C# 6.0 sugar and compile it with our above app. You can explore the language features that are completed in this list, from the Roslyn documentation in CodePlex

Here is some quick code that demos some of the features

1- Support for Primary Constructors and Auto Property Assignments

Primary constructors allow you to specify arguments as part of your class declaration itself. Also, now C# supports assigning to Auto properties. Together, you may use it to initialise classes, as shown below.

//C# 6.0 may support Primary Constructors and Assignment To Auto Properties

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace CSharp6Test

   //Feature: Primary Constructors or Class with arguments ;)
    class Point(int x,int y)
        //Feature: Look ma, you can assign to auto properties now
        public int X {get;set;} =x;
        public int Y {get;set;} =y;


    class MainClass
        public static void Main() 
            //Using Primary Constructor
            var p=new Point(1,3);

            //Reading the values back from the properties

2 – Invocation of static methods directly

Another completed feature seems like the support for direct invocation of static methods directly, with out the full name space

//Just import the namespace
using System.Console;

namespace CSharp6Test
    class Class1
   public static void Main() 
  //And use the static methods now directly  
  WriteLine("Look ma, now you can use static methods like this...");

And that works too.

3 – Dictionary initializers and indexed member access

Say good bye to the dirty strings when working with dictionary objects and collections. Let us see if the new $ Indexed member syntax is going to work. See it here.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Console;

namespace CSharp6Test
    class Class1
  public static void Main() 
      //See the new dictionary initializer syntax
   var d=new Dictionary<string,int>{["item1"]=1,["item2"]=2}; 
   //This is even better
   //var d=new Dictionary<string,int>{$item1=1,$item2=2}; 
   //And now you can access indexed members using $variablename syntax
   //No more dirty strings


Try compiling the above apps with our Roslyn test app (Get it here). Compile them :)  Leaving all the other features up to you to try out. Happy Coding!!

CSharp6Test // Creating A Tiny Roslyn app To Explore The Features Of C# 6.0

Some of the C# 6.0 Features are exciting. And you can try them out now as the new Roslyn preview is out. You can explore the language features that are completed in this list, from the Roslyn documentation in CodePlex. Some of the ‘Done’ features for C#, based on the documentation there include

  • Primary constructors   -   class Point(int x, int y) { … }  
  • Auto-property initializers -     public int X { get; set; } = x; 
  • Getter-only auto-properties  -   public int Y { get; } = y;  
  • Using static members  -   using System.Console; … Write(4);
  • Dictionary initializer  -  new JObject { ["x"] = 3, ["y"] = 7 }    
  • Indexed member initializer   -  new JObject { $x = 3, $y = 7 }    
    Indexed member access -   c.$name = c.$first + " " + c.$last;  
  • Declaration expressions -   int.TryParse(s, out var x);    
  • Await in catch/finally  -  try … catch { await … } finally { await … }   
  • Exception filters -    catch(E e) if (e.Count > 5) { … }   

In this post, we’ll explore.

  • How to parse and walk the Roslyn syntax tree and dump it
  • How to write a small ‘compiler’ using Roslyn’s CSharpCompilation
  • Use the same to explore if those features are implemented.

We could’ve used REPL/Scripting APIs, but sadly it is not available in the new version pre-release – So let us write a simple app using Roslyn APIs to test the new features in C# Winking smile

[Ouch, lazy to write the code? Fork it from Github - ]

Our CSharp6Test App

Create a new C# Project in Visual Studio 2012/2013, fire up Nuget console, and install Rolsyn pre release bits. I’m doing this in VS 2012.

  Install-Package Microsoft.CodeAnalysis -Pre

Now, let us write some code to parse the Syntax tree and do the compilation


using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace CSharp6Test

    class Program
        static void Main(string[] args)


                Console.WriteLine("C#> Roslyn Code Verifier.");
                if (args.Count() < 0)
                    Console.WriteLine("C#> Usage: CSharp6Test <file>");

                //User provided the output filename
                string file = args[0];

                string output = (args.Count() > 2) ? output = args[1] : file + ".exe";

                //Create a syntax tree from the code
                SyntaxTree tree = CSharpSyntaxTree.ParseText(File.ReadAllText(file));

                Console.WriteLine("C#> Dumping Syntax Tree");

                //Dumping it using our extension method

                Console.WriteLine("C#> Trying to compile Syntax Tree");

            catch (Exception ex)
                Console.WriteLine("Oops: {0}", ex.Message);
                Console.WriteLine("Sorry Skywalker, that was an exception - Love, Yoda");


Well, as it is evident, we are just parsing the syntax tree from the file, dumping it, and then compiling it. The Dump and Compile extension methods are here for your service.


using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace CSharp6Test
    public static class SyntaxTreeExtensions
        public static void Dump(this SyntaxTree tree)
            var writer = new ConsoleDumpWalker();

        class ConsoleDumpWalker : SyntaxWalker
            public override void Visit(SyntaxNode node)
                int padding = node.Ancestors().Count();
                //To identify leaf nodes vs nodes with children
                string prepend = node.ChildNodes().Count() > 0 ? "[-]" : "[.]";
                //Get the type of the node
                string line = new String(' ', padding) + prepend +
                                        " " + node.GetType().ToString();
                //Write the line


        public static void Compile(this SyntaxTree tree, string output)

            //Creating a compilation
            var compilation = CSharpCompilation.Create
         ("CSharp6Test", new List<SyntaxTree> { tree })
          new MetadataFileReference
                               new MetadataFileReference
                               new MetadataFileReference
                               new MetadataFileReference
                               new MetadataFileReference

            //Lets do the diagnostics
            var check = compilation.GetDiagnostics();

            //Report issues if any
            if (check.Count() > 0)

                bool hasError = false;

                Console.WriteLine("C#> Few Issues Found");
                foreach (var c in check)
                    Console.WriteLine("{0} : {1} in {2}", c.Severity, c.GetMessage(), c.Location);
                    if (c.Severity == DiagnosticSeverity.Error) hasError = true;
                if (hasError)
                    Console.WriteLine("C#> Errors found. Aborting");


            var emit = compilation.Emit(output);

            if (emit.Success)
                Console.WriteLine("C#> No Errors Found. Created {0}", output);
                Console.WriteLine("C#> Oops, can't create {0}", output);


And that is it. Go ahead and compile as you please. You've a tiny tool to test the new C# features. Checkout my next post about experimenting with some of the C# 6.0 features.

Hadoop On Azure / HDInsight– Quick Intro Video On Writing Map Reduce Jobs In C#

Here is a quick intro screen cast on Big Data and creating map reduce jobs in C# to distribute the processing of large volumes of data, leveraging Microsoft Azure HDInsight / Hadoop On Azure, based on my virtual tech days presentation.


ScriptCs Templating Support


If you havn’t yet heard about ScriptCs, it is not too late. Go here 

I just checked in a ScriptCs Templating module to integrate Razor and StringTemplate transformations in ScriptCs workflow. ScrtipCs Templating module can apply a Razor or StringTemplate (ST4) template on top of one or more model files (normally an xml file or json file), for scenarios like code generation or templating. Example below.

The first bits are

Installing template module

  • Install scriptcs. You need to install the nightly build using Chocoloately - cinst scriptcs -pre -source (Refer if you don't understand this)
  • Clone the repo at or download the source code from there.
  • Open the solution and build it in Visual Studio.
  • From the command line create a package: nuget pack -version 0.1.0-alpha
  • In VS, edit your Nuget package sources (Tools->Library Package Manager->Settings) and add the folder where the package lives locally.
  • Install the template package globally: scriptcs -install ScriptCs.Engine.Templating -g -pre

Rendering a template

  • Note that what ever you pass after -- will goto the module as it's arguments.
  • Run scriptcs with your template, file specifying our template module using -modules switch: scriptcs mytemplate.cst -loglevel debug -modules template - The -modules template argument let scriptcs load our template module.
  • Check the log output for details.
  • You can specify the output file using the -out switch -scriptcs mytemplate.cst -modules template -- -out result.txt (The parameters after -- are the template module parameters according to ScriptCs convention)

Rendering a template Using Models

Template module automagically converts xml files/urls and json files/urls to dynamic models that can be used from your template. Technically, it creates a C# fleuent dynamic object that wraps the xml/json.

Quick example: Create a new folder, and create a model.xml file inside that.

<class name="MyClass1">
  <property name="MyProperty1" type="string"/>
  <property name="MyProperty2" type="string"/>
  <class name="MyClass2">
    <property name="MyProperty1" type="string"/>
    <property name="MyProperty2" type="string"/>

Now, create your template, and save it as template.cst - Let us use razor syntax.

@model dynamic

@foreach(var item in Model["class"])
   <p> is a class name</p>  

Now, you can run the transformation by specifying the model file, like this

scriptcs template.cst -modules template -- -xml model.xml -out result.txt

Let us rewrite the template to generate the class and properties from our XML model file

@model dynamic

@foreach(var c in Model["class"])
   @:class {

   foreach(var p in c["property"])
       @:public @p.type {get;set;}


Regenerate the result file and see.

Transforming multiple model files

You may specify multiple model files

scriptcs template.cst -modules template -- -xml model1.xml -out result1.txt -xml model2.xml -out result2.txt

You may also use Json files as the model

scriptcs template.cst -modules template -- -json model.json -out result.cs

Accessing models from templates

For converting XML files/data to a dynamic model object that'll be accessed from templates, ElasticObject is used. Refer

For converting Json files/data to a dynamic model object that'll be accessed from templates, DynamicJsonConverter is used. Refer

Template module command line

Example usages:

  • Render using template mytemplate.cst and ElasticObject dynamic model from model1.xml using Razor (Razor is default)
    •  scriptcs mytemplate.cst -modules template -- -xml model1.xml -out result1.txt
  • Do the same as above, but using vb as the template language
    • scriptcs mytemplate.vbt -modules template -- -vb -xml model1.xml -out result1.txt
  • Render using template mytemplate.st4 - Using StringTemplate instead of Razor
    • scriptcs mytemplate.st4 -modules template -- -st4 -xml model1.xml -out result1.txt

Parameter meaning:

Additional References:

More tests need to be added, and St4 support is a bit untested. Happy Coding.

CakeRobot - A Gesture Driven Robot That Follows Your Hand Movements Using Arduino, C# and Kinect

Over the last few weekends I’ve spend some time building a simple robot that can be controlled using Kinect. You can see it in action below.
Ever since I read this Cisco paper that mentions Internet Of Things will crate a whooping 14.4 Trillion $ at stake, I revamped my interests in hobby electronics and started hacking with DIY boards like Arduino and Raspberry Pi. That turned out to be fun, and ended up with the robot. This post provides you the general steps, and the github code may help you build your own.
Even if you don’t have a Kinect for the controller, you can easily put together a controller using your phone (Windows Phone/Android/iOS) as we are using blue tooth to communicate with the controller and the robot.

Now, here is a quick start guide to build your own. We’ve an app running in the laptop that is communicating to the Robot via blue tooth in this case that pumps the commands based on input from Kinect, you could easily build a phone UI as well.
And if you already got the idea, here is the code – You may read further to build the hardware part.

1 – Build familiarity

You need to build some familiarity with Arduino and/or Netduino – In this example I’ll be using Arduino.


Explore Arduino. The best way is to
Mainly you need to understand the Pins in the Arduino board. You can write simple programs with the Arduino IDE (try the Blink sample to blink an LED in IDE File->Samples). PFB the pins description, from SparkFun website.
    • GND (3): Short for ‘Ground’. There are several GND pins on the Arduino, any of which can be used to ground your circuit.
    • 5V (4) & 3.3V (5): The 5V pin supplies 5 volts of power, and the 3.3V pin supplies 3.3 volts of power. Most of the simple components used with the Arduino run happily off of 5 or 3.3 volts. If you’re not sure, take a look at Spark Fun’s datasheet tutorial then look up the datasheet for your part.
    • Analog (6): The area of pins under the ‘Analog In’ label (A0 through A5 on the UNO) are Analog In pins. These pins can read the signal from an analog sensor (like a temperature sensor) and convert it into a digital value that we can read.
    • Digital (7): Across from the analog pins are the digital pins (0 through 13 on the UNO). These pins can be used for both digital input (like telling if a button is pushed) and digital output (like powering an LED).
    • PWM (8): You may have noticed the tilde (~) next to some of the digital pins (3, 5, 6, 9, 10, and 11 on the UNO). These pins act as normal digital pins, but can also be used for something called Pulse-Width Modulation (PWM). We have a tutorial on PWM, but for now, think of these pins as being able to simulate analog output (like fading an LED in and out).
    • AREF (9): Stands for Analog Reference. Most of the time you can leave this pin alone. It is sometimes used to set an external reference voltage (between 0 and 5 Volts) as the upper limit for the analog input pins.



2 – Get The Components

Again, you could find an online store to buy these components. Also, you could try the nearby local electronic store and buy some bread boards, jumper wires (get some mail to mail & female to female wires) etc as well. Here is the list of components you need to build CakeRobot.
  • A Chassis – I used Dagu Magician Chassis  – From Spark Fun, Rhydolabz – it comes with two stepper motors that can be controlled by our driver board.
  • A Arduino board with Motor Driver – I used Dagu Mini Motor Driver – bought from Rhydolabz in India. For other countries you need to search and find. Some description about the board can be found here – It also has a special slot to plug in dagu blue tooth shield. You could also use Micro Magician
    • You also need a Micro USB cable to connect your PC to the motor driver to upload code.
  • A Bluetooth Shield – Get the Dagu blue tooth module if you can find it. I’ve purchased a Class 2 Rn 42 blue tooth shield From Rhydolabz
  • Few mini modular bread boards
  • Jumper wires – Get a mixed pack with M/M, F/F, M/F – like this one
  • A Tiny blue tooth dongle for your PC/Laptop, like this one, to communicate with the blue tooth shield in the robot (if you don’t have built in blue tooth)
  • Few sensors if you want to have more fun – I had an ultra sonic distance sensor to avoid collisions in my final version. A better alternative is Ping sensor.
  • A battery pack and battery holder that should supply around 6V to the Mini Motor Driver
  • Other components for your later creativity/exploration
    • LEDs
    • Resistors
    • More Sensors
  • Tools
    • Few star screw drivers
    • Duct tapes/rubber bands (yea, we are prototyping so no soldering as of now)
And I found these reads pretty useful

3 – Programming the components

You need to spend some hours figuring out how to program each of the components.
  • To start with, play with Arduino a bit, connecting LEDs, switches etc. Then, understand a bit about programming the Digital and Analog pins. Play with the examples
  • Try programming the ultrasonic sensor if you’ve one using your Arduino, using serial sockets. If you are using Ping sensor, check out this
  • Try programming the blue tooth module (Code I used for the distance sensor and blue tooth module are in my examples below, but it’ll be cool if you can figure things out yourself).

4 – Put the components together

Assemble the Dagu Magician Chassis, and place/screw/mount the mini motor driver and Bluetooth module on top of the same. Connect the components using jumper wires/plugin as required. A high level schematic below.

Here is a low resolution snap of mine, from top.

5 – Coding the Arduino Mini Driver

You can explore the full code in the Github repo  - How ever, here are few pointers. According to the Dagu Arduino Mini driver spec, the following digital pins can be used to control the motors
  • D9 is left motor speed
  • D7 is left motor direction
  • D10 is right motor speed
  • D8 is right motor direction

To make a motor move, first we need to set the direction by doing a digitalWrite of HIGH or LOW (for Forward/Reverse) to the direction pin. Next set the motor speed by doing an analogWrite of 0~255 to the speed pin. 0 is stopped and 255 is full throttle.
In the Arduino code, we are initiating communication via blue tooth, to accept commands as strings. For example, speedl 100 will set the left motor speed to 100, and speedr 100 will set the right motor speed to 100. Relevant code below.
        //Setting up the communication with Bluetooth shield over serial 

        Serial.begin(115200);  // rn42 bt


      //Read the input In getSerialLine (shortened for brevity)

      while(serialIn != '\n')
  if (!(Serial.available() > 0)) 

  serialIn =;
  if (serialIn!='\n') {
   char a = char(serialIn);
   strReceived += a;


        //Process the command (shortened for brevity)
 else if (command=="speedl")
  val=getValue(input,' ',1).toInt();
 else if (command=="speedr")
  val=getValue(input,' ',1).toInt();


Have a look at the full code of the quick Arduino client here. Then, compile and upload the code to your mini driver board.

6 – Coding the Controller & Kinect

Essentially, what we are doing is just tracking the Skeletal frame, and calculating the distance of your hand from your hip to provide the direction and speed for the motors. Skeletal tracking details here
We are leveraging for identifying the Blue tooth shield to send the commands. Please ensure your blue tooth shield is paired with your PC/Laptop/Phone – you can normally do that by clicking the blue tooth icon in system tray in Windows, and clicking Add Device.
       //For each 600 ms, send a new command
       //_btCon is our instance variable for a blue tooth connection, built over the cool 32Feet library
       internal void ProcessCommand(Skeleton skeleton)

            var now = DateTime.Now;
            if (now.Subtract(_prevTime).TotalMilliseconds < 600)

            _prevTime = DateTime.Now;

            Joint handRight = skeleton.Joints[JointType.HandRight];
            Joint handLeft = skeleton.Joints[JointType.HandLeft];
            Joint shoulderRight = skeleton.Joints[JointType.ShoulderRight];
            Joint shoulderLeft = skeleton.Joints[JointType.ShoulderLeft];
            Joint hipLeft = skeleton.Joints[JointType.HipLeft];
            Joint hipRight = skeleton.Joints[JointType.HipRight];
            Joint kneeLeft = skeleton.Joints[JointType.KneeLeft];

            if (handRight.Position.Y < hipRight.Position.Y)
                _btCon.SetSpeed(Motor.Left, 0);

            if (handLeft.Position.Y < hipLeft.Position.Y)
                _btCon.SetSpeed(Motor.Right, 0);

            if (handRight.Position.Y > hipRight.Position.Y)
                var speed = (handRight.Position.Y - hipRight.Position.Y) * 200;
                if (speed > 230) speed = 230;
                _btCon.SetSpeed(Motor.Left, (int)speed);

            if (handLeft.Position.Y > hipLeft.Position.Y)
                var speed = (handLeft.Position.Y - hipLeft.Position.Y) * 200;
                if (speed > 230) speed = 230;
                _btCon.SetSpeed(Motor.Right, (int)speed);

And so, it sets the speed based on your hand movements. Explore the ConnectionHelper and BluetoothConnector classes I wrote.


The code is here in Github. Fork it and play with it, and expand it.

Reactive Extensions Or Rx (More On IEnumerable, IQueryable, IObservable and IQbservable) - Awesome Libraries For C# Developers #2

In my last post – we had a look at Interactive Extensions. In this post, we’ll do a recap of Reactive Extensions and LINQ to Event streams.

imageReactive Extensions are out there in the wild for some time, and I had a series about Reactive Extensions few years back. How ever, after my last post on Interactive Extensions, I thought we should discuss Reactive extensions in a bit more detail. Also, in this post we’ll touch IQbservables – the most mysteriously named thing/interface in the world, may be after Higgs Boson. Push and Pull sequences are everywhere – and now with the devices on one end and the cloud at the other end, most of the data transactions happen via push/pull sequences. Hence, it is essential to grab the basic concepts regarding the programming models around them.

First Things First

Let us take a step back and discuss IEnumerable and IQueryable first, before discussing further about Reactive IObservable and IQbservable (Qbservables = Queryable Observables – Oh yea, funny name).


As you may be aware, the IEnumerable model can be viewed as a pull operation. You are getting an enumerator, and then you iterate the collection by moving forward using MoveNext on a set of items till you reach the final item. And Pull models are useful when the environment is requesting data from an external source. To cover some basics - IEnumerable has a GetEnumerator method which returns an enumerator with a MoveNext() method and a Current property. Offline tip - A C# for each statement can iterate on any dumb thing that can return a GetEnumerator.  Anyway, here is what the non generic version of IEnumerable looks like.

public interface IEnumerable
    IEnumerator GetEnumerator();

public interface IEnumerator
    Object Current {get;}
    bool MoveNext();
    void Reset();

Now, LINQ defines a set of operators as extension methods, on top of the generic version of IEnumerable – i.e,  IEnumerable<T>  - So by leveraging the type inference support for Generic Methods, you can invoke these methods on any IEnumerable with out specifying the type. I.e, you could say someStringArray.Count() instead of someStringArray.Count<String>(). You can explore Enumerable class to find these static extensions.

The actual query operators in this case (like Where, Count etc) with related expressions are compiled to IL, and they operate in process much like any IL code is executed by CLR. From an implementation point of view, the parameters of LINQ clauses like Where is a lambda expression (As you may be already knowing, the from.. select is just Syntax sugar that gets expanded to extension methods of IEnumerable<T>), and in most cases a delegate like Func<T,..> can represent an expression from an in memory perspective. But what if you want apply query operators on items sitting some where else? For example, how to apply LINQ operators on top of a set of data rows stored in a table in a database that may be in the cloud, instead of an in memory collection that is an IEnumerable<T>? That is exactly what IQueryable<T> is for.


IQueryable<T> is an IEnumerable<T> (It inherits from IEnumerable<T>) and it points to a query expression that can be executed in a remote world. The LINQ operators for querying objects of type IQueryable<T> are defined in Queryable class, and returns Expression<Func<T..>> when you apply them on an IQueryable<T>, which is a System.Linq.Expressions.Expression (you can read about expression trees here). This will be translated to the remote world (say a SQL system) via a query provider. So, essentially, IQueryable concrete implementations points to a query expression and a Query Provider – it is the job of Query Provider to translate the query expression to the query language of the remote world where it gets executed. From an implementation point of view, the parameters you pass for LINQ that is applied on an IQueryable is assigned to an Expression<T,..> instead. Expression trees in .NET provides a way to represent code as data or kind of Abstract Syntax Trees. Later, the query provider will walk through this to construct an equivalent query in the remote world.

    public interface IQueryable : IEnumerable {       
        Type ElementType { get; }
        Expression Expression { get; }
        IQueryProvider Provider { get; }
    public interface IQueryable<T> : IEnumerable<T>, IQueryable, IEnumerable {

For example, in LINQ to Entity Framework or LINQ to SQL, the query provider will convert the expressions to SQL and hand it over to the database server. You can even view the translation to the target query language (SQL), just by looking at the  Or in short, the LINQ query operators you apply on IQueryable will be used to build an expression tree, and this will be translated by the query provider to build and execute a query in a remote world. Read this article if you are not clear about how an expression trees are built using Expression<T> from Lambdas. 

Reactive Extensions

So, now let us get into the anatomy and philosophy of observables.

IObservable <T>

As we discussed, objects of type IEnumerable<T>  are pull sequences. But then, in real world, at times we push things as well – not just pull. (Health Alert – when you do both together, make sure you do it safe). In  a lot of scenarios, push pattern makes a lot of sense – for example, instead of you waiting in a queue infinitely day and night with your neighbors in front of the local post office to collect snail mails, the post office agent will just push you the mails to your home when they arrive.

Now, one of the cool things about push and pull sequences are, they are duals. This also means, IObservable<T> is a dual of IEnumerable<T> – See the code below. So, to keep the story short, the dual interface of IEnumerable, derived using the Categorical Duality is IObservable. The story goes like some members in Erik’s team (he was with Microsoft then) had a well deserved temporal meglomaniac hyperactive spike when they discovered this duality. Here is a beautiful paper from Erik on that if you are more interested – A brief summary of Erik’s paper is below.

//Generic version of IEnumerable, ignoring the non generic IEnumerable base

interface IEnumerable<out T>
	IEnumerator<T> GetEnumerator();

interface IEnumerator<out T>: IDisposable
	bool MoveNext(); // throws Exception
	T Current { get; } 

//Its dual IObservable

interface IObservable<out T>
	IDisposable Subscribe(IObserver<T> observer);

interface IObserver<in T>
	void OnCompleted(bool done);
	void OnError(Exception exception);
	T OnNext { set; } 

Surprisingly, the IObservable implementation looks like the Observer pattern.

Now, LINQ operators are cool. They are very expressive, and provide an abstraction to query things. So the crazy guys in the Reactive Team thought they should take LINQ to work against event streams. Event streams are in fact push sequences, instead of pull sequences. So, they built IObservable. IObservable fabric lets you write LINQ operators on top of push sequences like event streams, much like the same way you query IEnumerable<T>.  The LINQ operators for an object of type IObservable<T> are defined in Observable class. So, how will you implement a LINQ operator, like where, on an observer to do some filtering? Here is a simple example of the filter operator Where for an IEnumerable and an IObservable (simplified for comparison). In the case of IEnumerable, you dispose the enumerator when we are done with traversing.

 //Where for IEnumerable

        static IEnumerable<T> Where<T>(IEnumerable<T> source, Func<T, bool> predicate)
            // foreach(var element in source)
            //   if (predicate(element))
            //        yield return element;
            using (var enumerator = source.GetEnumerator())
                while (enumerator.MoveNext())
                    var value= enumerator.Current;
                    if (predicate(value))
                        yield return value;

//Where for IObservable

        static  IObservable<T> Where<T>(this IObserver<T> source, Func<T, bool> predicate)
           return Observable.Create<T>(observer =>
                   return source.Subscribe(Observer.Create<T>(value =>
                               if (predicate(value)) observer.OnNext(value);
                           catch (Exception e)

Now, look at the IObservable’s Where implementation. In this case, we return the IDisposable handle to an Observable so that we can dispose it to stop  subscription. For filtering, we are simply creating an inner observable that we are subscribing to the source to apply our filtering logic inside that - and then creating another top level observable that subscribes to the inner observable we created. Now, you can have any concrete implementation for IObservable<T> that wraps an event source, and then you can query that using Where!! Cool. Observable class in Reactive extensions has few helper methods to create observables from events, like FromEvent. Let us create an observable, and query the events now. Fortunately, the Rx Team already has the entire implementation of Observables and related Query operators so that we don’t end up in writing customer query operators like this.

You can do a nuget for install-package Rx-Main  to install Rx, and try out this example that shows event filtering.

  //Let us print all ticks between 5 seconds and 20 seconds
            //Interval in milli seconds
            var timer = new Timer() { Interval = 1000 };

            //Create our event stream which is an Observable
            var eventStream = Observable.FromEventPattern<ElapsedEventArgs>(timer, "Elapsed");
            var nowTime = DateTime.Now;

            //Same as eventStream.Where(item => ...);

            var filteredEvents = from e in eventStream
                                 let time = e.EventArgs.SignalTime
                                     time > nowTime.AddSeconds(5) &&
                                     time < nowTime.AddSeconds(20)
                                 select e;

            //Subscribe to our observable
            filteredEvents.Subscribe(t => Console.WriteLine(DateTime.Now));

            Console.WriteLine("Let us wait..");
            //Dispose filteredEvents explicitly if you want

Obviously, in the above example, we could’ve used Observable.Timer – but I just wanted to show how to wrap an external event source with observables. Similarly, you can wrap your Mouse Events or WPF events.  You can explore more about Rx and observables, and few applications here. Let us move on now to IQbservables.


Now, let us  focus on IQbservable<T>. IQbservable<T> is the counterpart to IObserver<T> to represent a query on push sequences/event sources as an expression, much like IQueryable<T> is the counterpart of IEnumerable<T>. So, what exactly this means?  If you inspect IQbservable, you can see that

public interface IQbservable<out T> : IQbservable, IObservable<T>

    public interface IQbservable
        Type ElementType { get; }
        Expression Expression { get; }
        IQbservableProvider Provider { get; }

You can see that it has an Expression property to represent the LINQ to Observable query much like how IQueryable had an Expression to represent the AST of a LINQ query. The IQbservableProvider is responsible for translating the expression to the language of a remote event source (may be a stream server in the cloud).


This post is a very high level summary of Rx Extensions, and here is an awesome talk from Bart De Smet that you cannot miss.

And let me take the liberty of embedding the drawing created by Charles that is a concrete representation of the abstract drawing Bart did in the white board. This is the summary of this post.

representation of the three dimensional graph of Rx's computational fabric

We’ll discuss more practical scenarios where Rx and Ix comes so handy in future – mainly for device to cloud interaction scenarios, complex event processing, task distribution using ISheduler etc - along with some brilliant add on libraries others are creating on top of Rx. But this one was for a quick introduction. Happy Coding!!

Interactive Extensions - Awesome Libraries For C# Developers #1

Recently while I was giving a C# talk,  I realized that a lot of developers are still not familiar with the advantages of some of the evolving, but very useful .NET libraries. Hence, I thought about writing a high level post introducing some of them as part of my Back To Basics series, generally around .NET and Javascript. In this post we’ll explore Interactive Extensions, which is a set of extensions initially developed for Reactive Extensions by the Microsoft Rx team.


Interactive Extensions, at its core, has a number of new extensions methods for IEnumerable<T> – i.e it adds a number of utility LINQ to Object query operators.  You may have hand coded some of these utility extension methods some where in your helpers or utility classes, but now a lot of them are aggregated together by the Rx team.  Also, this post assumes you are familiar with the cold IEnumerable model and iterators in C#. Basically, what C# compiler does is, it takes an yield return statement and generate a class out of that for each iterator. So, in one way, each C# iterator internally holds a state machine.  You can examine this using Reflector or something, on a method yield returning an IEnumerator<T>. Or better, there is a cool post from my friend Abhishek Sur here or this post about implementation of Iterators in C#

More About Interactive Extensions

Fire up a C# console application, and install the Interactive Extensions Package using install-package Ix-Main . You can explore the System.Linq.EnumerationsEx namespace in System.Interactive.dll  - Now, let us explore some useful extension methods that got added to IEnumerable.


Examining Few Utility Methods In Interactive Extensions

Let us quickly examine few useful Utility methods.


What the simplest version of 'Do' does is pretty interesting. It'll lazily invoke an action on each element in the sequence, when we do the enumeration leveraging the iterator.

 //Let us create a set of numbers
 var numbers = new int[] { 30, 40, 20, 40 };
 var result=numbers.Do(n=>Console.WriteLine(n));

 Console.WriteLine("Before Enumeration");

 foreach(var item in result)
                //The action will be invoked when we actually enumerate                
 Console.WriteLine("After Enumeration");


And the result below. Note that the action (in this case, our Console.WriteLine to print the values) is applied place when we enumerate.


Now, the implementation of the simplest version of Do method is something like this, if you have a quick peek at the the Interactive Extensions source code here in Codeplex, you could see how our Do method is actually implemented. Here is a shortened version.

public static class StolenLinqExtensions
        public static IEnumerable<TSource> StolenDo<TSource>(this IEnumerable<TSource> source, Action<TSource> onNext)
            //Get the enumerator
            using (var e = source.GetEnumerator())
                while (true)
                    //Move next
                    if (!e.MoveNext())
                    var current = e.Current;

                    //Call our action on top of the current item

                    //Yield return
                    yield return current;


Cool, huh.


DoWhile in Ix is pretty interesting.  It generates an enumerable sequence, by repeating the source sequence till the given condition is true.

IEnumerable<TResult> DoWhile<TResult>(IEnumerable<TResult> source, Func<bool> condition)

Consider the following code.

  var numbers = new int[] { 30, 40, 20, 40 };

  var then = DateTime.Now.Add(new TimeSpan(0, 0, 10));
  var results = numbers.DoWhile(() => DateTime.Now < then);

  foreach (var r in results)

As expected, you’ll see the foreach loop enumerating results repeatedly, till we reach meet the DateTime.Now < then condition – i.e, till we reach 10 seconds.


Scan will take a sequence, to apply an accumulator function to generate a sequence of accumulated values. For an example, let us create a simple sum accumulator, that'll take a set of numbers to accumulate the sum of each number with the previous one

 var numbers = new int[] { 10, 20, 30, 40 };
 //0 is just the starting seed value
 var results = numbers.Scan(0,(sum, num) => sum+num);

 //Print Results. Results will contain 10, 30, 60, 100

 //10+20 = 30
 //30 + 30 = 60
 //60 + 40 = 100
And you may have a look at the actual Scan implementation, from the Rx repository in Codeplex . Here is an abbreviated version.
IEnumerable<TAccumulate> StolenScan<TSource, TAccumulate>
   (this IEnumerable<TSource> source, TAccumulate seed, Func<TAccumulate, 
                               TSource, TAccumulate> accumulator)
            var acc = seed;

            foreach (var item in source)
                acc = accumulator(acc, item);
                yield return acc;


We just touched the tip of the iceberg, as the objective of this post was to introduce you to Ix. We may discuss this in a bit more depth, after covering few other libraries including Rx. There is a pretty exciting talk from Bart De Smet here that you should not miss. Ix is specifically very interesting because of it’s functional roots. Have a look at the Reactive Extensions repository in Codeplex for more inspiration, that should give you a lot more ideas about few functional patterns. You may also play with Ix Providers and Ix Async packages.

As usual, happy coding!!

Building A Recommendation Engine - Machine Learning Using Windows Azure HDInsight, Hadoop And Mahout

Feel like helping some one today?

imageLet us help the Stack Exchange guys to suggest questions to a user that he can answer, based on his answering history, much like the way Amazon suggests you products based on your previous purchase history.  If you don’t know what Stack Exchange does – they run a number of Q&A sites including the massively popular Stack Overflow. 

Our objective here is to see how we can analyze the past answers of a user, to predict questions that he may answer in future. May Stack Exchange’s current recommendation logic may work better than ours, but that won’t prevent us from helping them for our own  learning purposes Winking smile.

We’ll be doing the following tasks.

  • Extracting the required information from Stack Exchange data set
  • Using the required information to build a Recommender

But let us start with the basics.   If you are totally new to Apache Hadoop and Hadoop On Azure, I recommend you to read these introductory articles before you begin, where I explain HDInsight and Map Reduce model a bit in detail.

Behind the Scenes

Here we go, let us get into some “data science” woo do first. Cool!! Distributed Machine learning is mainly used for

  • Recommendations  - Remember the Amazon Recommendations? Normally used to predict preferences based on history.
  • Clustering  - For tasks like finding grouping together related documents from a set of documents, or finding like minded people from a community
  • Classification  - For identifying which set of category a new item belongs to. This normally includes training the system first, and then asking the system to detect an item.

“Big Data” jargon is often used when you need to perform operations on a very large data set. In this article, we’ll be dealing with extracting some data from a large data set, and building a Recommender using our extracted data.

What is a Recommender?

Broadly speaking, we can build a recommender either by

  • Finding questions that a user may be interested in answering, based on the questions answered by other users like him
  • Finding other questions that are similar to the questions he answered already.

imageThe first technique is known as user based recommendation, and the second technique is known as item based recommendations.

In the first case, taste can be determined by how may questions you answered in common with that user (the questions both of you answered). For example, think about User1, User2, User3 and User4 – Answering few questions Q1, Q2, Q3 and Q4. This diagram shows the Questions answered by the users

Based on the above diagram, User1 and User2 answered Q1, Q2 and Q3. Now, User3 answered Q3 and Q2, but not Q1.  Now, to some extent, we can safely assume that User3 will be interested in answering Q1 – because two users who answered Q2 and Q3 with him already answered Q1. There is some taste matching here, isn’t it?  So, if you have a array of {UserId, QuestionId} – it seems that data is enough for us to build a recommender.

The Logic Side

Now, how exactly we are going to do build a question recommender? In fact it is quite simple.

First, we need to find the number of times a pair of questions co-occur across the available users. Note that this matrix is having no relations with the user. For example, if Q1 and Q2 is appearing together 2 times (as in the above diagram), co occurrence value at {Q1,Q2} will be 2. Here is the co-occurrence matrix (hope I got this right).

  • Q1 and Q2 co-occurs 2 times (User1 and User2 answered Q1 ,Q2)
  • Q1 and Q3 co-occurs 2 times (User1 and User2 answered both Q1, Q3)
  • Q2 and Q3 co-occurs 3 times (User1, User2 and User3 answered Q2, Q3)
  • Like wise..


The above matrix just captures how many times a pair of questions co-occurred (answered) as discussed above. There is no mapping with users yet. Now, how we’ll relate this to find a user’s preference? To find out how close a question ‘matches’ a user, we just need to

  • Find out how often that question co occurs with other questions answered by a that user
  • Eliminate questions already answered by the user.

For the first step, we need to multiply the above matrix with the user’s preference matrix.

For example, let us Take User3. For User3, the Preference mapping with questions [Q1,Q2,Q3,Q4] is [0,1,1,0] because he already answered Q2 and Q3, but not Q1 and Q4. So, let us multiply this with the above co-occurrence matrix. Remember that this is a matrix multiplication /dot product. The Result indicates how often a Question co-occurs with other questions answered by a user (weightage).


We can omit Q2 and Q3 from the results, as we know the User 3 already answered them. Now, from the remaining, Q1 and Q4 – Q1 has the higher value (4) and hence the higher taste matching with User3. Intuitively, this indicated Q1 co-occurred with the questions already answered by User 3 (Q2 and Q3) more than Q4 co-occurred with Q2 and Q3 – so User3 will be interested in answering Q1 more than Q4. In an actual implementation, note that the User’s taste matrix will be a sparse matrix (mostly zeros) as the user will be answering only a very limited subset of questions in the past. The advantage of the above logic is, we can use a distributed map reduce model for compute with multiple map-reduce tasks - Constructing the co-occurrence matrix, Finding the dot product for each user etc.

Now, let us start thinking about the implementation.


From the implementation point of view,

  1. We need to provision a Hadoop Cluster
  2. We need to download and extract the data to analyze (Stack Overflow data)
  3. Job 1 – Extract the Data - From each line, extract {UserId, QuestionId} for all questions answered by the user.
  4. Job 2 – Build the Recommender - Use the output from above Map Reduce to build the recommendation model where possible items are listed against each user.

Let us roll!!

Step 1 - Provisioning Your Cluster

Now remember, the Stack Exchange data is huge. So, we need to have a distributed environment to process the same. Let us head over to Windows Azure. If you don’t have an account, sign up for the free trial. Now, head over to the preview page, and request the HDInsight (Hadoop on Azure) preview.

Once you have the HD Insight available, you can create a Hadoop cluster easily. I’m creating a cluster named stackanalyzer.



Once you have the cluster ready, you’ll see the Connect and Manage buttons in your dashboard (Not shown here). Connect to the head node of your cluster by clicking the ‘Connect’ button, which should open a Remote Desktop Connection to the head node. You may also click the ‘Manage’ button to open your web based management dashboard. (If you want, you can read more about HD Insight here)

Step 2 - Getting Your Data To Analyze

Once you connected to your cluster’s head node using RDP, you may download the Stack Exchange data. You can download the Stack Exchange sites data from Clear Bits, under Creative Commons. I installed Mu-Torrent client in the head node, and then downloaded and extracted the data for – The extracted files looks like this – a bunch of XML files.


What we are interested is in the Posts XML File. Each line represents either a question, or an answer. If it is a question, PostTypeId =1, and if it is an answer, PostTypeId=2.The ParentId represents the question’s Id for an answer, and OwnerUserId represents the guy who wrote the answer for this question. 

<row Id="16" PostTypeId="2" ParentId="2" CreationDate="2010-07-09T19:13:37.540" Score="3"
     Body="&lt;p&gt;...shortenedforbrevity...  &lt;/p&gt;&#xA;"
     OwnerUserId="34" LastActivityDate="2010-07-09T19:13:37.540" />

So, for us, we need to extract the {OwnerUserId, ParentId} for all posts where PostTypeId=2 (Answers) which is a representation of {User,Question,Votes}. The Mahout Recommender Job we’ll be using later will take this data, and will build a Recommendation result.

Now, extracting this data itself is a huge task when you consider the Posts file is huge. For the Cooking site, it is not so huge – but if you are analyzing the entire Stack Overflow, the Posts file may come in GBs. For extraction of this data itself, let us leverage Hadoop and write a custom Map Reduce Job.

Step 3 - Extracting The Data We Need From the Dump (User, Question)

To extract the data, we’ll leverage Hadoop to distribute. Let us write a simple Mapper. As mentioned earlier, we need to figure out {OwnerUserId, ParentId} for all posts with PostTypeId=2. This is because, the input for the Recommender Job we may run later is {user, item}.  For this, first load the Posts.XML to HDFS. You may use the hadoop fs command to copy the local file to the specified input path.


Now, time to write a custom mapper to extract the data for us. We’ll be using Hadoop On Azure .NET SDK to write our Map Reduce job.  Not that we are specifying the input folder and output folder in the configuration section. Fire up Visual Studio, and create a C# Console application. If you remember from my previous articles, hadoop fs <yourcommand> is used to access HDFS file system, and it’ll help if you know some basic *nix commands like ls, cat etc.

Note: See my earlier posts regarding the first bits of HDInsight to understand more about Map Reduce Model and Hadoop on Azure

You need to install the Hadoop Map Reduce package from Hadoop SDK for .NET via Nuget package manager.

install-package Microsoft.Hadoop.MapReduce 

Now, here is some code where we

  • Create A Mapper
  • Create a Job
  • Submit the Job to the cluster

Here we go.

using System;
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using Microsoft.Hadoop.MapReduce;

namespace StackExtractor

    //Our Mapper that takes a line of XML input and spits out the {OwnerUserId,ParentId,Score} 
    //i.e, {User,Question,Weightage}
    public class UserQuestionsMapper : MapperBase
        public override void Map(string inputLine, MapperContext context)
                var obj = XElement.Parse(inputLine);
                var postType = obj.Attribute("PostTypeId");
                if (postType != null && postType.Value == "2")
                    var owner = obj.Attribute("OwnerUserId");
                    var parent = obj.Attribute("ParentId");
                    // Write output data. Ignore records will null values if any
                    if (owner != null && parent != null )
                        context.EmitLine(string.Format("{0},{1}", owner.Value, parent.Value));
                //Ignore this line if we can't parse

    //Our Extraction Job using our Mapper
    public class UserQuestionsExtractionJob : HadoopJob<UserQuestionsMapper>
        public override HadoopJobConfiguration Configure(ExecutorContext context)
            var config = new HadoopJobConfiguration();
            config.DeleteOutputFolder = true;
            config.InputPath = "/input/Cooking";
            config.OutputFolder = "/output/Cooking";
            return config;


    //Driver that submits this to the cluster in the cloud
    //And will wait for the result. This will push your executables to the Azure storage
    //and will execute the command line in the head node (HDFS for Hadoop on Azure uses Azure storage)
    public class Driver
        public static void Main()
                var azureCluster = new Uri("https://{yoururl}");
                const string clusterUserName = "admin";
                const string clusterPassword = "{yourpassword}";

                // This is the name of the account under which Hadoop will execute jobs.
                // Normally this is just "Hadoop".
                const string hadoopUserName = "Hadoop";

                // Azure Storage Information.
                const string azureStorageAccount = "{yourstorage}";
                const string azureStorageKey =
                const string azureStorageContainer = "{yourcontainer}";
                const bool createContinerIfNotExist = true;
                Console.WriteLine("Connecting : {0} ", DateTime.Now);

                var hadoop = Hadoop.Connect(azureCluster,

                Console.WriteLine("Starting: {0} ", DateTime.Now);
                var result = hadoop.MapReduceJob.ExecuteJob<UserQuestionsExtractionJob>();
                var info = result.Info;

                Console.WriteLine("Done: {0} ", DateTime.Now);
                Console.WriteLine("\nInfo From Server\n----------------------");
                Console.WriteLine("StandardError: " + info.StandardError);
                Console.WriteLine("StandardOut: " + info.StandardOut);
                Console.WriteLine("ExitCode: " + info.ExitCode);
            catch(Exception ex)
                Console.WriteLine("Error: {0} ", ex.StackTrace.ToString(CultureInfo.InvariantCulture)); 
            Console.WriteLine("Press Any Key To Exit..");


Now, Compile and run the above program. The ExecuteJob will upload the required binaries to your cluster, and will initiate a Hadoop Streaming Job that’ll run our Mappers on the cluster, with input from the Posts file we stored earlier in the Input folder. Our console application will submit the Job to the cloud, and will wait for the result. The Hadoop SDK will upload the map reduce binaries to the blob, and will build the required command line to execute the job (See my previous posts to understand how to do this manually).  You can inspect the job by clicking Hadoop Map Reduce status tracker from the desktop short cut in the head node.

If everything goes well, you’ll see the results like this.


As you see above, you can find the output in /output/Cooking folder. If you RDP to your cluster’s head node, and check the output folder now, you should see the files created by our Map Reduce Job.


And as expected, the files contain the extracted data, which represents the UserId,QuestionId – For all questions answered by a user. If you want, you can load the data from HDFS to Hive, and then view the same with Microsoft Excel using the ODBC for Hive. See my previous articles.

Step 4 – Build the recommender And generate recommendations

As a next step, we need to build the co-occurrence matrix and run a recommender job, to convert our {UserId,QuestionId} data to recommendations. Fortunately, we don’t need to write a Map Reduce job for this. We could leverage Mahout library along with Hadoop. Read about Mahout Here

RDP to the head node of our cluster, as we need to install Mahout. Download the latest version of Mahout (0.7) as of this writing, and copy the same to the c:\app\dist folder in the head node of your cluster.


Mahout’s Recommender Job has support for multiple algorithms to build recommendations – In this case, we’ll be using SIMILARITY_COOCCURRENCE. The Algorithms Page of Mahout website has lot more information about Recommendation, Clustering and Classification algorithms. We’ll be using the files we’ve in the /output/Cooking folder to build our recommendation.

Time to run the Recommender job. Create a users.txt file and place the IDs of the users for whom you need recommendations in that file, and copy the same to HDFS.


Now, the following command should start the Recommendation Job. Remember, we’ll use the output files from our above Map Reduce job as input to the Recommender. Let us kick start the Recommendation job. This will generate output in the /recommend/ folder, for all users specified in the users.txt file. You can use the –numRecommendations switch to specify the number of recommendations you need against each user. If there is a preference relation with a user and and item, (like the number of times a user played a song), you could keep the input dataset for a recommender as {user,item,preferencevalue} – In this case, we are omitting the preference weightage.

Note: If the below command fails after re run complaining output directory already exists, just try removing the tmp folder and the output folder using hadoop fs –rmr temp and hadoop fs –rmr /recommend/

hadoop jar c:\Apps\dist\mahout-0.7\mahout-core-0.7-job.jar -s SIMILARITY_COOCCURRENCE 

After the job is finished, examine the  /recommend/ folder, and try printing the content in the generated file. You may see the top recommendations, against the user Ids you had in the users.txt.


So, the recommendation engine think User  1393 may answer the questions 6419, 16897 etc if we suggest the same to him. You could experiment with other Similarity classes like SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION etc to find the best results. Iterate and optimize till you are happy.

For an though experiment here is another exercise - Examine the Stack Exchange data set, and find out how you may build a Recommender to show a ‘You may also like’ questions based on the questions a user favorite?


In this example, we were doing a lot of manual work to upload the required input files to HDFS, and triggering the Recommender Job manually. In fact, you could automate this entire work flow leveraging Hadoop For Azure SDK. But that is for another post, stay tuned. Real life analysis has much more to do, including writing map/reducers for extracting and dumping data to HDFS, automating creation of hive tables, perform operations using HiveQL or PIG, etc. However, we just examined the steps involved in doing something meaningful with Azure, Hadoop and Mahout.

You may also access this data in your Mobile App or ASP.NET Web application, either by using Sqoop to export this to SQL Server, or by loading it to a Hive table as I explained earlier. Happy Coding and Machine Learning!! Also, if you are interested in scenarios where you could tie your existing applications with HD Insight to build end to end workflows, get in touch with me.

I suggest you to read further.

© 2012. All Rights Reserved.