Step by Step guide to run Apache Spark on HDInsight cluster


Recently I have been experimenting a bit with cloud technologies including Amazon Web Services (AWS) and also Microsoft Azure. I have MSDN subscription which entitles me to free credits to utilize Azure resources. I was trying to run some big data workloads on Microsoft Azure’s offering in the Big Data space called HDInsight. HDInsight offers multiple types of clusters including Hadoop, HBase, Storm, Spark, R Server, Kafka and Interactive Hive. As of this writing Kafka & Hive clusters are in preview state. I decided to try the Spark cluster as I am currently exploring different features of Spark. This post is about different steps required to run Spark jobs using HDInsight cluster. I will start with provisioning the HDInsight cluster and in the following posts extend it to show executing Spark jobs from within IntelliJ IDE and also from Azure command line in upcoming posts.

Pre-requisites for running the code examples

If you wish to have a look at the source code, clone or download the repo learning spark from my GitHub repository.
As mentioned in the file of the repo, following dependencies must be satisfied in order to build the project

  • Java 8 is installed on your laptop / PC
  • JAVA_HOME environment variable is set correctly
  • Apache Spark 1.6.3 with Hadoop 2.6 is installed
  • SPARK_HOME environment variable is set correctly
  • Maven is installed
  • It is also good to have an IDE. I prefer to use the IntelliJ IDEA community edition. You can use other IDE like Eclipse or Visual Studio Code. As we will see later in the post, IntelliJ or Eclipse will give us the benefit or running code on Azure cloud directly from the IDE using a plugin. If you prefer to use VS Code, I will also demonstrate how to run Spark using Azure Command Line Interface (CLI).

Additional dependencies

MovieLens dataset : MovieLens dataset is publicly available sample dataset for performing analytics queries and for big data processing.
Azure Subscription : Apart from building the project from source code, you also need to have Microsoft Azure subscription in order to submit the jobs to remote cluster. I assume all the pre-requisites mentioned here are fulfilled.

Build the project

The codebase contains some basic Spark programs like WordCount and examples of PairRDD, MapToDouble, caching example etc. There are programs using the MovieLens dataset which are more for running in the cluster scenario. Open the project in your favorite IDE and build the project. I prefer to do it by using the terminal using mvn clean package command in the root of the source code directory. It might take some time to download all the dependencies from various repositories if this is the first item you are running maven. If all the environment variables were set correctly, you should have a successful build and the package should be created under the target folder.

Run Spark locally

maven build output

We can test the Spark program by running it locally using the command

spark-submit --class com.nileshgule.MapToDoubleExample \
--master local \
--deploy-mode client \
--executor-memory 2g \
--name MapToDouble \
--conf "" \

Note that we are specifying the master as local in the above execution. We are also setting the limit for executor memory to 2 GB. Once again if everything was setup correctly, then you should see the output similar to below screenshot

spark output local

Provision HDInsight cluster

There are multiple steps required inorder to run the Spark on HDInsight cluster. Each of these step can take some time. First of all we need to provision the cluster resources. This can be done in two ways. Easiest is to login to the Azure web portal. Alternate option is to use the Azure CLI. As of this writing Azure CLI 2.0 does not support provisioning HDInsight cluster. You can use the older version. I had 2.0 version installed so I was forced to use the web portal method. Refer to the Azure documentation on details related to provisioning different types of HDInsight clusters for more details. I will be using the ssh based approach to connect to the head node in the cluster. Before we provision the cluster, I need to generate the RSA public key. On my Mac I can generate the key by executing the command
ssh-keygen -t rsa

Provide the file location (default will be prompted, you can keep the default as is) and the passphrase. Remember the passphrase as it will be required later. With this prerequisite done, login to the Azure portal with your credentials. From the dashboards navigate to Add new resource screen and click on Data + Analytics section. HDInsight is the first option on the right. For quick access you can also search for HDInsights directly in the search bar.

search HD Insight

1 - Basic settings

basic config spark version

First thing we need to ensure is that the cluster name is unique. In my case I am using the MSDN subscription, if you have multiple subscriptions you will need to choose the one for billing purposes. Next we need to select the type of cluster. We select cluster type as Spark. The version is dependent on the type of cluster. In case of Spark cluster we need to select the appropriate Spark version. I chose Spark 1.6.3 as that is the one I am currently experimenting with. You can chose the other available versions if you wish to.

Next we need to provide credentials for logging into the cluster. We need an admin account and also the sshuser account which can enable us to submit the Spark jobs remotely. Provide the admin user password. Make sure to uncheck the use same password as cluster login checkbox. Instead we will use the public key we generated as the key by either selecting the file or pasting the contents of the file. When I created the cluster using the same password as admin user for the sshuser, I was unable to login to the head node.

In the SSH authentication type, select the PUBLIC KEY option.You can select the file or paste the contents of the .pub file from the location where you saved it when the RSA key was generated in the earlier step. In my case the file is stored in my home directory at ~/.ssh/

basic configuration

Penultimate step is to specify the resource group. I chose to reuse an existing one. You can chose one of the existing resource group or create a new one. Final step is to chose the location or region where the cluster will be created. In my case it was prefilled with East Asia.

2 - Storage

In this step I need to provide the storage details including Storage Account Settings and Meta Store Settings. The metastore settings are optional. I selected an existing storage account named ngstorageaccount and a default container as ng-spark-2017-08-18t14-24-10-259z. Ideally this should be having a meaningful name. I ended up reusing the default container name created for me by Azure since the first time. I do not need the additional storage accounts and Data Lake store access so I leave them blank. For the moment I do not wish to persist the metadata outside of the cluster. In a production scenario, it might be a good idea to store metadata outside of the cluster.

storage options

3 - Summary

Summary blade provides the summary of our choices made so far. It gives us last opportunity to customize the settings before Azure takes over and provisions the cluster resources for us.


I will make a slight change to the cluster nodes configuration by editing the cluster size. The default number of worker nodes is 4. I don’t intend to run heavy workloads at the moment. So I reduced it down to 2 worker nodes. Also the hardware configuration of each worker node is D4 V2 type. I changed it to a scaled down version with D12 V2 type.

resize head node

I did not modify any of the advanced setting and click the Create button after final review of all the settings. It will take 15 to 20 minutes to provision all the resources of the cluster. With this setup we are ready to run some Spark jobs on the cluster. If everything goes fine you should see a HDInsight cluster created as shown below

available cluster

In the part 2 of this post, I will demonstrate how to use IntelliJ IDE to submit a Spark job to Azure HDInsight cluster. The part 3 will focus on running the Spark jobs using Azure CLI. Till then happy programming.


No comments:

Post a Comment