Skip to main content

A Quick introduction to Hadoop Hive on Azure and Querying Hive using LINQ in C#

imageEarlier, in a couple of posts related to Hadoop on Azure - Analyzing some ‘Big Data’ using C# and Extracting Top 500 MSDN Links from Stack Overflow – I showed how to use C# Map Reduce Jobs with Hadoop Streaming to do some meaningful analytics.

Now, a preview version of the .NET SDK for Hadoop is available, making it easier to work with Hadoop from .NET – with more types for supporting Map Reduce Jobs, For creating LINQ to Hive queries etc.  You can experiment with Hadoop and C# either by creating a cluster in http://hadooponazure.com or you can obtain Hadoop in your machine by installing Microsoft HDInsight using WebPI.

In case you are new to Hadoop on Azure, I suggest you read the introductory concepts here before you start. This post is just a quick example that shows how to use LINQ to Hive.

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems

Installing the libraries

To start with, you can fire up Visual Studio, Create a console project, and install Microsoft.Hadoop.Hive libraries via Nuget.

install-package Microsoft.Hadoop.Hive -pre 

Also, head over to http://hadoopoonazure.com and create a new cluster. And you are now set.

Creating the typed wrappers

To access Hive, you need to create a strongly typed wrapper – as of now, you need to roll this out your own, as there is no automated generation support. When you provision a Hadoop cluster, the Hive will be pre populated with a sample table (hivesampletable), and I’m using the same for the below example for brevity.  You can connect to the Hive via ODBC and see the hive tables in Excel.

So, let us go ahead and create a hive connection (much like an EF data context) and a typed representation for a row in the table. HiveConnection and HiveTable types are in the Microsoft.Hadoop.Hive namespace.

    
    //Our concrete hive connection
    
    public class SampleHiveConnection : HiveConnection
    {
        public SampleHiveConnection(string hostName, int port) 
            : base(hostName, port, null, null) { }

        public SampleHiveConnection(string hostName, int port, 
                            string username, string password) 
            : base(hostName, port, username, password) { }

        public HiveTable<DeviceInfo> DeviceInfoTable
        {
            get
            {
                return this.GetTable<DeviceInfo>("hivesampletable");
            }
        }
    }

    //A typed row. Property names based on field names hivesampletable
    
    public class DeviceInfo : HiveRow
    {
        public string DevicePlatform { get; set; }
        public string DeviceMake { get; set; }
        public int ClientId { get; set; }
    }

Querying the Hive using LINQ

Now,  you may perform LINQ queries against your Hive context, thanks to the Hadoop SDK we installed via Nuget. Just make sure to substitute the connection string, username and password with your own.

class Program
    {
        static void Main(string[] args)
        {


            //Create a hive connection
            //I've my cluster in https://www.hadooponazure.com
            var hive = new SampleHiveConnection(
                    "saintcluster.cloudapp.net", //your connection string
                    10000,                       //port                    
                    "user",                      //your username
                    "yourpass");                 //your password


            //Get the results
            //Make sure you goto the dashboard and turn on the ODBC port
            var res = from d in hive.DeviceInfoTable
                      where d.ClientId < 100
                      select d;

            //Dump it to the console if you like
            var list = res.ToList();     

        }
    }
That is cool. Your LINQ query will be submitted to the Azure cluster via the ODBC driver, and will be compiled and executed in the Hive.

Popular posts from this blog

Top 7 Coding Standards & Guideline Documents For C#/.NET Developers

Some time back, I collated a list of 7 Must Read, Free EBooks for .NET Developers, and a lot of people found it useful. So, I thought about putting together a list of Coding Standard guidelines/checklists for .NET /C# developers as well.As you may already know, it is easy to come up with a document - the key is in implementing these standards in your organization, through methods like internal trainings, Peer Reviews, Check in policies, Automated code review tools etc. You can have a look at FxCop and/or StyleCop for automating the review process to some extent, and can customize the rules based on your requirements.Anyway, here is a list of some good Coding Standard Documents. They are useful not just from a review perspective - going through these documents can definitely help you and me to iron out few hidden glitches we might have in the programming portion of our brain. So, here we go, the listing is not in any specific order.1 – IDesign C# Coding StandardsIDesign C# coding stand…

5 Awesome Learning Resources For Programmers (To help you and your kids to grow the geek neurons)

Happy New Year, this is my first post in 2012. I’ll be sharing few awesome learning resources I’ve bookmarked, and will be pointing out some specific computer/programming related courses I've found interesting from these resources.Also, thought about saving this blog post for my kids as well - instead of investing in these Child education schemes (though they are too small as of today, 2 years and 60 days respectively ). Anyway, personally my new year resolution is to see as much videos from this course collections (assuming I can find some free time in between my regular job && changing my babies diapers).1 – Khan AcademyAs I mentioned some time back, you and your kids are missing some thing huge if you havn’t heard about Khan Academy.  It is an awesome learning resource, especially if you want to re-visit your basics in Math, Science etc.With a library of over 2,600 videos covering everything from arithmetic to physics, finance, and history and 268 practice exercises, th…

Hack Raspberry Pi – How To Build Apps In C#, WinForms and ASP.NET Using Mono In Pi

Recently I was doing a bit of R&D related to finding a viable, low cost platform for client nodes. Obviously, I came across Raspberry Pi, and found the same extremely interesting. Now, the missing piece of the puzzle was how to get going using C# and .NET in the Pi. C# is a great language, and there are a lot of C# developers out there in the wild who are interested in the Pi.In this article, I’ll just document my findings so far, and will explain how develop using C# leveraging Mono in a Raspberry Pi. Also, we’ll see how to write few minimal Windows Forms & ASP.NET applications in the Pie as well.Step 1: What is Raspberry Pi?Raspberry Pi is an ARM/Linux box for just ~ $30. It was introduced with a vision to teach basic computer science in schools. How ever, it got a lot of attention from hackers all around the world, as it is an awesome low cost platform to hack and experiment cool ideas as Pi is almost a full fledged computer.  More About R-Pi From Wikipedia.The Raspberry Pi