Skip to main content


Showing posts from June, 2012

Top 500 MSDN Links from Stack Overflow Posts

I explained in In my previous post how to run C# Map/Reduce jobs in Hadoop on Azure to find the top Namespaces in Stackoverflow posts. After that, I did another Map/Reduce on the Stackoverflow data dump and here is the list of Top 500 MSDN Urls we all referred in our Stackoverflow posts. This is just done on partial post data from the Stakoverflow data dump. Thought about sharing the same as it looked very interesting.I used the following Mapper to parse this, almost the same as in the previous example, with a regex to parse the URLs.using System.IO; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Text.RegularExpressions; using System.Threading.Tasks; namespace StackOverflowAnalyzer.Mapper { class Program { static void Main(string[] args) { string line; Regex reg = new Regex(@"http(s)?://msdn([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)\.as…

Analyzing some ‘Big’ Data Using C#, Azure And Apache Hadoop – A Stack Overflow .NET Namespace Popularity Finder

Time to do something meaningful with C#, Azure and Apache Hadoop. In this post, we’ll explore how to create a Mapper and Reducer in C#, to analyze the popularity of namespaces in the Stack overflow posts. Before we begin, let us explore Hadoop and Map Reduce concepts shortly.A Quick Introduction To Map Reduce Map/Reduce is a programming model to process insanely large data sets, initially implemented by Google. The Map and Reduce functions are pretty simple to understand. Map(list) –> List of Key, Value The Map function will process a data set and splits the same to multiple key/value pairs Aggregate, Group  The Map/Reduce framework may perform operations like group,sort etc on the output of Map function. The Grouping will be done based on the Keys and the values for a given key is passed to the Reduce method Reduce(Key, List of Values for the key)  ->  Another List of Key,Value The Reduce method may normally perform a aggregate function (sum, average or even other com…