Quantcast
Channel: Content Master » Hadoop
Viewing all articles
Browse latest Browse all 3

A Lap Around HDInsight

$
0
0

(cross-posted from The Blog @ Graemesplace)

I’m currently working with Microsoft’s Patterns and Practices team, researching and documenting best practices guidance for big data analysis with HDInsight. For those of you who may not know, HDInsight is Microsoft’s distribution of Hadoop – an open source platform for analysis of huge volumes of complex data; and as a database guy with a background in BI, I’m obviously interested in figuring out the best ways to use HDInsight, and how it integrates with the existing SQL Server technologies that I’m used to working with. The results of our work will be published soon, but in the meantime, I thought it might be useful to share some basic steps to help you get started with HDInsight.

Before we start however, it might be useful to explain a bit about HDInsight, Hadoop, and Big Data. If you’re involved in database development or BI, you can hardly have missed the growing buzz about Big Data. There seem to be a lot of definitions being bandied around (and products being advertised), and it can all be very confusing – so, here’s my take on it.

  • “Big Data”is a catch-all term that’s popularly used to describe data that you want to analyze, but is difficult to process with traditional database technologies. The difficulty is usually caused by one or more of the so-called three V’s:
    • Volume: The sheer amount of data makes analysis with a traditional relational database impractical.
    • Variety: The data is in difficult to process formats, and may be completely unstructured.
    • Velocity: The data is generated at a rapid pace and must be processed as close to real-time as possible.
  • Hadoop is an Apache open source project for a big data analysis solution that is based on distributing the data processing across multiple server nodes in a cluster. The distributed processing uses a technique pioneered by those bright folks at Google, called Map/Reduce. The data in a Hadoop cluster is stored on a distributed file system called HDFS, which enables each node to work on a subset of the data in parallel (the Map phase), and then the results from each node are consolidated into a single result (the Reduce phase). You can write the Map/Reduce code to process your data in Java (which is the native language for Hadoop), or in other languages supported by the Hadoop distribution you’re using (such as JavaScript, Python, or C#). In addition to the core Hadoop project, there are a number of other projects that add value to a Hadoop cluster. In particular, a project named Pig provides an abstraction layer over Map/Reduce that enables you to write and execute data processing workflow steps in a language named Pig Latin, which are then translated to Map/Reduce jobs by the Pig interpreter. Another significant Hadoop related project is Hive, which provides an abstraction over Map/Reduce in which data in HDFS files can be represented as tables and queried with a SQL-like language named HiveQL.
  • HDInsight is the brand-name of the Microsoft distribution of Hadoop. It’s an open source distribution, based on the same Hadoop core as other distributions, and available as an on-premise application for Windows Server, or as a cloud service hosted in Windows Azure.

Provisioning a Cluster

The first thing you need to do is to provision an HDInsight cluster, and this is your first major decision point. You can opt to install a local, on-premise HDInsight cluster on Windows Server via the Microsoft Web Platform Installer, or you can create a cloud-based cluster on Windows Azure. At the time of writing, HDInsight is available as a preview at http://www.microsoft.com/bigdata/, and you can either download HDInsight for Windows Server or sign up for HDInsight on Windows Azure from there. For the purposes of this article, I’ll use HDInsight on Windows Azure, but after you’ve completed the initial sign-up process or installation, the user experience is much the same.

HDInsight on Windows Azure is currently a by-invitation preview, but anyone can be invited by simply signing up. It usually takes a few days for your invitation to arrive by email, and then you can provision a 3-node HDInsight cluster for five days. In this preview program, after the five days, the cluster is automatically released; but you can simply create another one. To provision a cluster, you just need to specify a unique DNS name and a username and password to be used as a login, as shown here.

Picture1

The DNS name you assign will determine the address for the cluster in the cloudapp.net domain. I’ve specified “graemesplace”, so the address of my cluster is http://graemesplace.cloudapp.net.

The provisioning process takes a few minutes, and when your cluster is ready you can access the dashboard page in a browser as shown here:

Picture2

Obtaining and Loading Data for Analysis

So now I have an HDInsight cluster, I’m ready to analyze some data. Being British, I am of course obsessed with the weather, so I plan to do some analysis of historic weather in my home town of Dunfermline, Scotland. To obtain this data, I’m going to use the UK Met Office Weather dataset that is available in the Windows Azure Datamarket, as shown here. If you haven’t explored the Azure Datamarket, it’s a one-stop shop for data from government organizations and private commercial data publishers, and includes a pretty eclectic mix of datasets that you can purchase and download. Of course, like any true Scotsman, I’ve chosen the Met office weather data because it’s free!

Picture3

The Met office data includes weather data from all over the UK, but since I’m only interested in weather statistics for Dunfermline, I’ll filter my search by the site code field (which uniquely identifies each weather station). I can look up the site code for Dunfermline in the Site dataset as shown here.

Picture4

Now I can filter the data from the DailyForecast dataset based on the ID for the Dunfermline site as shown here.

Picture5

Note the URL for the currently expressed query. This is a REST-based URI that can be used to download the data from a client application. Note also that you can click Show to display a secure primary account key, which is used to authenticate requests from client applications. One such client application is Excel, in which you can download data from the Windows Azure Datamarket as shown here.

Picture6

To download the data, I just specify the URL and account key from the query page in the Azure Datamarket site, and (after a while) the data appears in the worksheet. I can then remove any columns I don’t need as shown here.

Picture7

Note that the data includes multiple measurements for each day (taken at different times). To analyze the data properly, I plan to use HDInsight to calculate an average figure for each day. I’m happy that I have the columns I need, so I just remove the headings and save the data as a comma-delimited text file to be uploaded to my HDInsight cluster.

You can upload data to the default HDFS storage in the cluster or, if you’re using HDInsight on Windows Azure, you can associate an Azure Blob Store account with the cluster and use Azure storage volume (ASV) containers for your data. Using ASV gives you the advantage of being able to store as much data as you need, beyond the limits of the 1.5 TB of HDFS storage provided with the cluster, and also ensures that you’re data is not lost when the cluster is released after the 5-day preview period.

To keep things simple, I’m going to use the default HDFS storage in the cluster. You can access this from the interactive console in the HDInsight dashboard and use the fs.put() JavaScript command to display a dialog box that enables you to  upload the data file as shown here: Picture8

You can also use the interactive console to navigate the HDFS file system using UNIX-style commands prefixed with a “#” character. For example, to display a directory listing you can type #ls. HDFS has a root folder (indicated by “/”), and each user has a home folder under the /user folder, so my home folder on the cluster is /user/graeme. By default, your home folder is the current folder when you connect; and as a shortcut, you can use a “.” character to indicate the current folder, so I can view the contents of a subfolder named weather in my home folder by using either of the following commands:

#ls /user/graeme/weather

#ls ./weather

So now I’ve uploaded the comma-delimited text file containing the weather data to ./weather/data.txt, and I’m ready to process it.

Understanding Map/Reduce

Fundamentally, all processing on HDInsight takes the form of a Map/Reduce job. Each node performs a Map job on a subset of the data, and then a Reduce job is performed by a designated node to consolidate the results of the Map jobs and produce a single output file. Natively, Map/Reduce jobs are implemented in Java and compiled to .jar executable, which are uploaded and executed as jobs in the cluster. However, you can also use other languages to implement Map/Reduce code. I’m not a Java programmer (indeed, some would say I’m not a programmer!), so I’ve limited myself to creating a simple example Map/Reduce program in JavaScript as shown here:

Picture9

The code consists of a map function and a reduce function. The map function is run on all cluster nodes, each node submitting a subset of the source data one line at a time in the value parameter. The map code creates an array from the line of data, using a comma as a delimiter, and writes a key/value pair consisting of the date and temperature fields from the source data to the context object.

The reduce function operates on each distinct key generated from the map operations, and iterates through the values associated with that key. The key value is the date, and the values are the temperature readings for that date. The code adds all of the temperature values, and then divides them by the number of readings for that date to calculate an average value. The consolidated results for each key are then written as a key/value pair, so we should end up with a single date and average temperature for each date value in the source data.

The HDInsight dashboard provides a handy Create Job tile that you can use to run a Map/Reduce job that has been implemented as a Java .jar executable. However, since I’ve chosen to implement my code in JavaScript, I’ll need another way to execute it. The easiest way to run the job is to upload the JavaScript file (in this case to ./weather/getdailytemp.js) and  use Pig to call it, which I can do by entering the following command in the interactive JavaScript console:

pig.from(“./weather/data.txt”).mapReduce(“./weather/getdailytemp.js”, date, avgtemp:long).to(“./weather/avgdailytemp”);

This command tells Pig to run a Map/Reduce job on the weather data by using the JavaScript file I’ve uploaded and produce results with two columns: a date column (which in the absence of an explicit data type will be a chararray) and an avgtemp column (which is declared as a long integer). The results are then stored as a file in the ./weather/avgdailytemp folder.

When the job has finished (which can take some time), viewing the contents of the output directory reveals an output file (in this case named part-m-00000), a file named _SUCCESS (which is simply a flag to indicate the outcome of the job), and a _logs folder in which details of the Map/Reduce job execution have been logged.

Picture10

You can use the #cat command to view the contents of the results file as shown here:

Picture11

Using Pig

While I could write custom Map/Reduce code for all of the data processing my analysis needs, it would require considerably more programming skills than I currently have, and would be extremely time-consuming. Fortunately, I can use Pig to do more than just run my own Map/Reduce code; I can also use it to create data processing workflows in a high-level language called Pig Latin, and have Pig translate those workflows to Map/Reduce code for me.

You can execute Pig Latin statements interactively in a command line shell called (I kid you not) Grunt, or you can create a text file that includes a sequence of Pig Latin commands to be executed as a script. The easiest way to get to the Grunt shell is to use the Remote Desktop shortcut in the HDInsight dashboard to open a remote desktop session with the cluster (logging in with the credentials you specified when provisioning the cluster), and then using the Hadoop Command Line shortcut on the desktop to run pig from the c:\apps\dist\pig-0.9.3-SNAPSHOT\bin folder as shown here.

Picture12

When using Pig to process data, typically you execute a sequence of Pig Latin commands, each of which defines a relation. You can think of a relation as being a result set, often tabular in structure. The first command often uses the Pig Latin LOAD statement to populate a relation from a data file, and subsequent commands create new relations by performing operations on the relations created by previous commands. For example, I can use the following sequence of Pig Latin commands to process the weather data in ./weather/data.txt:

Forecasts = LOAD ‘./weather/data.txt’ USING PigStorage(‘,’) AS (day, date, temp:long, wind:long);
GroupedForecasts = GROUP Forecasts BY date;
GroupedAvgs = FOREACH GroupedForecasts GENERATE group, AVG(Forecasts.temp) AS avgtemp, AVG(Forecasts.wind) AS avgwind;
AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp, avgwind;

The first command loads the comma-delimited weather data to create a relation named Forecasts with four columns (day, date, temp, and wind).

The second command creates a new relation named GroupedForecasts that contains the Forecasts relation grouped by date.

The third command creates a relation named GroupedAvgs that calculates the average temp and wind values for each group in GroupedForecasts.

The fourth command creates a relation named AvgWeather, which flattens the date group in the GroupedAvgs relation to create a row for each date with date, avgtemp and avgwind columns.

Pig doesn’t actually generate any map/Reduce jobs until you call either a DUMP statement (to display the contents of a relation in the console) or a STORE statement (to save a relation as a file). For example, executing the command DUMP AvgWeather after the four commands above would create the necessary Map/Reduce jobs to process the workflow, and display the results in the console window. The command STORE AvgWeather INTO ‘./weather/dailyaverages’ would store results in the ./weather/dailyaverages folder.

To run the commands as a script, I simply save them in a file named ProcessWeather.pig and run them from the command line using the pig program, as shown here:

Picture13

The output from the Pig script is similar to that of the Map/Reduce job I ran earlier. When the script has completed, the ./weather/dailyaverages folder contains a file named part-r-00000, a _SUCCESS file, and a _logs folder. Viewing the part-r-00000 file shows the results of the processing:

Picture14

Using Hive

Until now, all of the data processing we’ve done with Map/Reduce code and Pig Latin has generated results in text files. You can view the text files in the interactive console or download them to a client application such as Excel, but it would be good to be able just to consume results from HDInsight the same was as you do from a relational database – by executing queries against tables. That’s exactly what Hive is designed to let you do.

With Hive, you can create tables that get their data from files in HDFS folders. The important thing to understand here is that Hive tables don’t contain data; but rather they provide a metadata layer that projects a schema onto data in underlying files. What does that mean? Well, Hive (in common with Pig and all Map/Reduce code) uses an approach called schema on read, in which a table is associated with a folder in the HDFS folder structure (actually, a table could be partitioned across multiple folders, but let’s not get ahead of ourselves!). When you query a table, the data in any files found in the related folder is read, and only then are the columns and data types in the table definition used to apply structure to the data. You can create a Hive table based on an empty folder, and then copy files to that folder (either source files or the results of Map/Reduce or Pig processing). When you query the table, a Map/Reduce job is generated to apply the table schema to the data in the files you have placed in the folder.

For example, I could create a table for average weather readings by using the following HiveQL statement:

CREATE TABLE avgweather
(weatherdate STRING,
avgtemp FLOAT,
avgwind FLOAT)

If you’re used to working with SQL Server, or any other SQL-based database, this should look pretty familiar. However, because the data will actually be retrieved at query-time from underlying files, there are a few additional options that you typically don’t find in relational databases. For example, you can specify the delimiter you want Hive to use when mapping data in the files to columns in the table by including a ROW FORMAT DELIMITED clause as shown in the following screenshot of the Hive interactive console. In this case, the fields are tab-delimited (which matches the output generated by the Pig script I ran earlier):

Picture15

The folder that this table maps to is automatically created in the /hive/warehouse folder, and is initially empty as shown here (note that a folder already exists for the hivesampletable table, which is provided with every HDInsight cluster):

Picture16

If you prefer not to use the default location, you can specify an alternative location as shown in the following code:

CREATE TABLE avgweather
(weatherdate STRING,
avgtemp FLOAT,
avgwind FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
STORED AS TEXTFILE LOCATION ‘/user/graeme/weather/avgweather’

The ability to use a non-default location is useful if you want processes outside of Hive to be able to access the data files, or if the files already exist and you just want to create a table over them. However, you should be aware that by default, Hive treats all tables as INTERNAL tables, which means that it tightly binds the lifetime of the folder to that of the table. Put more simply, if you delete a table by using the DROP TABLE command, the folder associated with the table (along with all the files it contains) is deleted. If you want the folder to exist independently of the table, you must create it as an EXTERNAL table as shown here:

CREATE EXTERNAL TABLE avgweather
(weatherdate STRING,
avgtemp FLOAT,
avgwind FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
STORED AS TEXTFILE LOCATION ‘/user/graeme/weather/avgweather’

After you have created a table, you can populate it by simply adding files to the folder with which the table is associated. You can do this by using the #cp (copy), #mv (move), #copyFromLocal, or #moveFromLocal HDFS commands in the interactive JavaScript console, or by using the LOAD HiveQL command in the interactive Hive window as shown here:

Picture17

This code loads data from the part-r-00000 file generated by the Pig script earlier. After executing this command, the contents of the dailyaverages folder created by the Pig script and the  avgweather folder mapped to the table looks like this:

Picture18

Note that the LOAD command has moved (not copied!) the part-r-00000 file from the dailyaverages folder to the avgweather folder.

Now that the folder associated with the table contains some data, you can query the table to project its schema onto the data, as shown here:

Picture19

Using Hive from Excel

One of the main benefits of Hive is that HDInsight includes an ODBC driver for Hive that enables client applications to connect to HDInsight and execute HiveQL queries, just like they can with SQL Server (or any other ODBC-compliant database server). By default, the port used by ODBC (10000) is closed, but you can open it by simply clicking the Open Ports tile on the HDInsight dashboard and toggling the switch as shown here:

Picture20

To install the ODBC driver, download click the Downloads tile on the HDInsight dashboard and download the appropriate version for your system (32-bit or 64-bit). Note that the driver download also installs an add-in for Excel, so be sure to choose the CPU architecture that matches your Office installation (and remember that you can install 32-bit Office on 64-bit Windows – I’m just saying!):

Picture21

After you’ve installed the driver, you can use it from Excel (and other ODBC-aware tools like SQL Server Integration Services, SQL Server Analysis Services (in tabular mode), and SQL Server Reporting Services). Installing the driver also installs a Hive Pane add-in for Excel, which makes it easy for users to connect to HDInsight and create HiveQL queries. However, I experienced a few issues when using this with Excel 2013 (and it seems I’m not alone), so rather than try to create connections explicitly in Excel, I recommend you create a system data source name (DSN) and use that – for some reason it seems a little more stable in this pre-release build!

To create a DSN, go to Administrative Tools and open Data Sources (ODBC). Then create a DSN that uses the Hive ODBC driver and references your HDInsight cluster, as shown here:

Picture22

Now you’re ready to connect to Hive on your HDInsight cluster from Excel. To do that, on the Data tab of the ribbon, click Hive Pane, and then select the DSN you created in the Hive Query pane and enter your credentials for the HDInsight cluster. You can then select a table and specify columns and filtering criteria to generate a HiveQL query. executing the query retrieves the results into the worksheet:

Picture23

By default, the Hive Pane limits the results to 200 by adding a LIMIT clause to the query, but at the bottom of the hive Query pane you can edit the query that has been generated and enter any HiveQL you like. After you have retrieved the data into Excel, you can sort and filter it, use it as the source for a chart, or pretty much do whatever analysis you need to:

Picture25

I’ve only really scratched the surface of HDInsight in this article, so be sure to keep an eye on the Patterns and Practices site for more detailed guidance. In the meantime, if I’ve inspired you to experiment with HDInsight for yourself, the best place to start is http://www.microsoft.com/bigdata/.

The post A Lap Around HDInsight appeared first on Content Master.


Viewing all articles
Browse latest Browse all 3

Trending Articles