Submit Apache Spark job from Command Line to HDInsight cluster

Background

This is the 3rd part of the Step by Step guide to run Apache Spark on HDInsight cluster. The first part was about provisioning the cluster followed by 2nd part about submitting jobs using IntelliJ IDE. In this post we will use Azure CLI to interact with the HDInsight cluster.

Pre-requisite

Azure CLI

This prerequisite was mentioned in the first part of this series.

Movielens dataset

We will use this public dataset for doing some analysis. You can download the dataset to your computer.

There is a slight change in the resource names used in the earlier posts and this one. Due to some changes to my subscription, I had to recreate some resources while creating the HDInsight cluster. The default container is named ng-spark-2017 instead of ng-spark-2017-08-18t14-24-10-259z. Same way the storage account name is changed from ngstorageaccount to ngsparkstorageaccount.

Use CLI to interact with Azure resources

Assuming that the Azure CLI is installed successfully, we can login to Azure subscription using command from our favorite terminal. 

az login

You will be presented with the url https://aka.ms/devicelogin and a code to enter. Just enter the code into the browser and you should login to the Azure subscription.

We will need to upload the jar to blobstorage account and also the files from the movielens dataset. In order to securely transfer files over the network, Azure provides 2 sets of keys. We can query these keys using list keys command.

az storage account keys list \
--resource-group ngresourcegroup \
--account-name ngsparkstorageaccount

list keys

The result of the command returns the keys as shown above. We can now use one of the keys to query the available containers.

az storage container list \
--account-name ngsparkstorageaccount \
--account-key <<your key here>>
list containers

I have tried using both the keys and the result is same. As seen from the above screenshot, we have ng-spark-2017 as the container. This is the default container specified at the time of creating the cluster. Now we upload the jar for our application to this blob storage container.

az storage blob upload \
--account-name ngsparkstorageaccount \
--account-key <<your key here>> \
--file target/learning-spark-1.0.jar \
--name learning-spark-1.0.jar \
--container-name ng-spark-2017
upload jar

In case you get any error check for the path of the jar file. With this step we have successfully uploaded the jar to blob storage account named ngsparkstorageaccount to a container named ng-spark-2017 with the filename learning-spark-1.0.jar. You will find these item highlighted in the above screenshot. Next step is to upload the files from Movielens dataset to the blob storage.

az storage blob upload-batch \
--destination ng-spark-2017/movielense_dataset \
--source ml-latest \
--account-name ngsparkstorageaccount \
--account-key <<your key here>> 

The command is almost similar to the previous one with a slight difference that we are using the batch mode (upload-batch) to transfer the files. Also note that instead of file and name parameters we are using the source & destination parameters. This command will take some time to complete based on the internet speed in your region. If you are bit tired of typing lengthy commands step back and take a coffee break.

Login to Head node

Once the artefacts are transferred to blob storage we can issue Spark submit command. We start by logging into the head node of the cluster. Once again the exact command for connecting to the head node is available via the Azure portal. Navigate to the cluster dashboard and you will find an option to connect using secure shell. Execute the command shown on the portal in the terminal

ssh [email protected]

You will need to remember the passphrase which was used to generate the RSA key in the 1st part of the series.

ssh prompt

Once the passphrase is validated, you will be connected to the head node of the cluster. The name of the head node is added to the known hosts lists which allows future connections using same credentials.

ssh success

Pro-tip:

At times when you try to connect to the head node using ssh, you might get an error. Run the following command

ssh-keygen -R ng-spark-ssh.azurehdinsight.net
Rerun the ssh command specified above & you should be able to login.

Submit spark commands

Lets start by issuing the command which executes a simple spark program.
spark-submit \
--class com.nileshgule.PairRDDExample \
--master yarn \
--deploy-mode cluster \
--executor-memory 2g \
--name PairRDDExample \
--conf "spark.app.id=PairRDDExample" \
wasb://[email protected]/learning-spark-1.0.jar

Note the syntax used for accessing the jar file here. We are using the Windows Azure Storage Blob (WASB) mechanism for accessing the files stored in blob storage. You can refer to the post on understanding WASB and Hadoop storage in Azure for more details.

spark submit

The program runs successfully. This program is using an in memory collection and does not really interact with the Hadoop resources. Lets trigger another Spark job which actually refers to the files from the movielens dataset.

spark-submit \
--packages com.databricks:spark-csv_2.10:1.5.0 \
--class com.nileshgule.movielens.UserAnalysis \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--executor-memory 2G \
--executor-cores 6 \
--name UserAnalysis \
--conf "spark.app.id=UserAnalysis" \
wasb://[email protected]/learning-spark-1.0.jar \
wasb://[email protected]/movielense_dataset/ratings.csv \
wasb://[email protected]/movielense_dataset/movies.csv

user analysis results

The program takes some time to finish. I am referring to the ratings.csv and movies.csv as the input files.In this example we can see that we are able to access files from Azure blob storage like normal files using the WASB.

Verify job execution

There are multiple ways to verify the output using the options available from the HDInsight cluster dashboard. We will use one of them this time around. The cluster dashboard link takes us to a blade with 5 options as shown below.

portal cluster dashboard

Spark history server and Yarn are easiest. We look at the Spark history server link.

spark history server

This will show the recent jobs status. You can drill down into the details of each job. This takes us to the Spark history UI. The Spark history UI provides rich information about different aspects related to the internals of Spark. Those details are well documented and are outside the scope of of this blog post.

Conclusion

As we saw during the course of this post, it is quite easy to use command line tools to connect to HDInsight cluster and submit Spark jobs by connecting to the head node. We also had the chance to scratch a bit about the Azure CLI when it came to uploading files from local machine to the blob storage. I have tried to run Spark workloads on other cloud platform. Mind you it is not so easy. Microsoft being Microsoft makes it very easy for developers to use their products. I hope the 3 posts in this series have been helpful. Until next time Happy Programming.


spacer