Submit Apache Spark job from IntelliJ to HDInsight cluster

Background

This is 2nd part of the Step by Step guide to run Apache Spark on HDInsight cluster. In the first part we saw how to provision the HDInsight Spark cluster with Spark 1.6.3 on Azure. In this post we will see how to use IntelliJ IDEA IDE and submit the Spark job.

Pre-requisite

IntelliJ IDEA IDEA installed on your machine. I am using the community edition but same works with ultimate edition as well.

Azure toolkit for IntelliJ this is an IntelliJ plugin. This plugin provides different functionalities which helps to create, test and deploy Azure application using IntelliJ. If you are an Eclipse user there is a similar plugin named Azure toolkit for Eclipse. I have not used it personally. I prefer to work with IntelliJ as it makes me feel more at home since I come from a C# background having used Visual Studio for more than a decade and a half.

Steps to submit Spark job using IntelliJ

1. Login to Azure account

If the plugin is installed correctly, you should see an option to sign in to Azure as shown below

Azure toolkit

Select Azure Sign in option to login to the Azure account. This will bring up the Azure Sign in dialog.

2 . Sign in to Azure in Interactive mode

Interactive login

Use the default option of Interactive and click on Sign in. I have not yet tested the automated authentication method.

Azure login dialog

Provide the Windows live account details here which has access to the Azure subscription. If the account login was successful, your subscription details will be pulled into the IDE as shown below

Select Suscription

As I have only one subscription, I can click on select button to access different services associated with the subscription.

3. Explore Azure resources (Optional)

We can verify that different resources available under this subscription can be accesses using the Azure Explorer. Navigate to Azure Explorer from the Tools menu and then selecting Tool Windows option.

Azure explorer menu

Selecting this option brings up the Azure Explorer Sidebar. We can access things like Container Registry, Docker hosts, HDInsight cluster, Redis Caches, Storage Accounts, Virtual Machines and Web Apps.

Azure explorer sidebar

In the above screenshot we can see the storage account named ngstorageaccount associated with my Azure subscription.

4. Submit Spark job

Lets move onto the most important part which is to submit the Spark job to cluster. From the sidebar, navigate to the Projects pane. Select the file containing main method. In fact any file would do as we need to specify the details in the dialog box that appears. Right click and select Submit Spark Application to HDInsight from the context menu. The option is available right at the bottom of context menu.

right click submit

This brings up a dialog box where we can specify the details related to the job. Select the appropriate options from the drop down list. I have selected the Spark cluster, location of the jar file and name of the Main class as com.nileshgule.MapToDoubleExample. Note that the fully qualified name needs to be supplied here.

submit spark job

We can provide the runtime parameters like driverMemory, driverCores etc. I chose to go with default values. The widget provides us options to pass command line arguments, reference jars and any reference files if required.

Once we submit the job, the Spark Submission pane loads up at the bottom of the screen and reports the progress of job execution. Assuming everything goes fine, you should see output which says that The Spark application completed successfully. The output will also contain additional information with a link to the YARN UI and also the detailed job log copied from the cluster to a local directory.

Spark Submission pane

5. Verify job execution

In order to verify that the job was successfully executed on the cluster, we can click the link to the YARN UI. This will bring up a login prompt as shown below.

admin password

Note that this is using the SSL port 443. Provide the credentials which were specified when we created the admin user for the cluster in the provisioning step. Hope you remember the password that was supplied at the time of creating the cluster. Once the admin user is authenticated, you will be presented with the YARN application container details.

Yarn application container

This page gives various details like the status reported by application master, link to the Spark history server using the tracking URL, link to logs and many other details. you can get more details about the job by navigating to different links provided on this page.

Conclusion

We saw that it is quite easy to trigger Spark job from IntelliJ IDE directly onto the HDInsight cluster. I have demonstrated very few capabilities of the Azure plugin in this post. Feel free to explore other options. In the next part we will see how we can submit the Spark jobs from Head node of the cluster using command line interface. Hope you found this information useful. Until next time Happy Programming.

spacer

Step by Step guide to run Apache Spark on HDInsight cluster

Background

Recently I have been experimenting a bit with cloud technologies including Amazon Web Services (AWS) and also Microsoft Azure. I have MSDN subscription which entitles me to free credits to utilize Azure resources. I was trying to run some big data workloads on Microsoft Azure’s offering in the Big Data space called HDInsight. HDInsight offers multiple types of clusters including Hadoop, HBase, Storm, Spark, R Server, Kafka and Interactive Hive. As of this writing Kafka & Hive clusters are in preview state. I decided to try the Spark cluster as I am currently exploring different features of Spark. This post is about different steps required to run Spark jobs using HDInsight cluster. I will start with provisioning the HDInsight cluster and in the following posts extend it to show executing Spark jobs from within IntelliJ IDE and also from Azure command line in upcoming posts.

Pre-requisites for running the code examples

If you wish to have a look at the source code, clone or download the repo learning spark from my GitHub repository.
As mentioned in the README.md file of the repo, following dependencies must be satisfied in order to build the project

  • Java 8 is installed on your laptop / PC
  • JAVA_HOME environment variable is set correctly
  • Apache Spark 1.6.3 with Hadoop 2.6 is installed
  • SPARK_HOME environment variable is set correctly
  • Maven is installed
  • It is also good to have an IDE. I prefer to use the IntelliJ IDEA community edition. You can use other IDE like Eclipse or Visual Studio Code. As we will see later in the post, IntelliJ or Eclipse will give us the benefit or running code on Azure cloud directly from the IDE using a plugin. If you prefer to use VS Code, I will also demonstrate how to run Spark using Azure Command Line Interface (CLI).

Additional dependencies

MovieLens dataset : MovieLens dataset is publicly available sample dataset for performing analytics queries and for big data processing.
Azure Subscription : Apart from building the project from source code, you also need to have Microsoft Azure subscription in order to submit the jobs to remote cluster. I assume all the pre-requisites mentioned here are fulfilled.

Build the project

The codebase contains some basic Spark programs like WordCount and examples of PairRDD, MapToDouble, caching example etc. There are programs using the MovieLens dataset which are more for running in the cluster scenario. Open the project in your favorite IDE and build the project. I prefer to do it by using the terminal using mvn clean package command in the root of the source code directory. It might take some time to download all the dependencies from various repositories if this is the first item you are running maven. If all the environment variables were set correctly, you should have a successful build and the package should be created under the target folder.

Run Spark locally

maven build output

We can test the Spark program by running it locally using the command

spark-submit --class com.nileshgule.MapToDoubleExample \
--master local \
--deploy-mode client \
--executor-memory 2g \
--name MapToDouble \
--conf "spark.app.id=MapToDoubleExample" \
target/learning-spark-1.0.jar

Note that we are specifying the master as local in the above execution. We are also setting the limit for executor memory to 2 GB. Once again if everything was setup correctly, then you should see the output similar to below screenshot

spark output local

Provision HDInsight cluster

There are multiple steps required inorder to run the Spark on HDInsight cluster. Each of these step can take some time. First of all we need to provision the cluster resources. This can be done in two ways. Easiest is to login to the Azure web portal. Alternate option is to use the Azure CLI. As of this writing Azure CLI 2.0 does not support provisioning HDInsight cluster. You can use the older version. I had 2.0 version installed so I was forced to use the web portal method. Refer to the Azure documentation on details related to provisioning different types of HDInsight clusters for more details. I will be using the ssh based approach to connect to the head node in the cluster. Before we provision the cluster, I need to generate the RSA public key. On my Mac I can generate the key by executing the command
ssh-keygen -t rsa

Provide the file location (default will be prompted, you can keep the default as is) and the passphrase. Remember the passphrase as it will be required later. With this prerequisite done, login to the Azure portal with your credentials. From the dashboards navigate to Add new resource screen and click on Data + Analytics section. HDInsight is the first option on the right. For quick access you can also search for HDInsights directly in the search bar.

search HD Insight

1 - Basic settings

basic config spark version

First thing we need to ensure is that the cluster name is unique. In my case I am using the MSDN subscription, if you have multiple subscriptions you will need to choose the one for billing purposes. Next we need to select the type of cluster. We select cluster type as Spark. The version is dependent on the type of cluster. In case of Spark cluster we need to select the appropriate Spark version. I chose Spark 1.6.3 as that is the one I am currently experimenting with. You can chose the other available versions if you wish to.

Next we need to provide credentials for logging into the cluster. We need an admin account and also the sshuser account which can enable us to submit the Spark jobs remotely. Provide the admin user password. Make sure to uncheck the use same password as cluster login checkbox. Instead we will use the public key we generated as the key by either selecting the file or pasting the contents of the file. When I created the cluster using the same password as admin user for the sshuser, I was unable to login to the head node.

In the SSH authentication type, select the PUBLIC KEY option.You can select the file or paste the contents of the .pub file from the location where you saved it when the RSA key was generated in the earlier step. In my case the file is stored in my home directory at ~/.ssh/id_rsa.pub

basic configuration

Penultimate step is to specify the resource group. I chose to reuse an existing one. You can chose one of the existing resource group or create a new one. Final step is to chose the location or region where the cluster will be created. In my case it was prefilled with East Asia.

2 - Storage

In this step I need to provide the storage details including Storage Account Settings and Meta Store Settings. The metastore settings are optional. I selected an existing storage account named ngstorageaccount and a default container as ng-spark-2017-08-18t14-24-10-259z. Ideally this should be having a meaningful name. I ended up reusing the default container name created for me by Azure since the first time. I do not need the additional storage accounts and Data Lake store access so I leave them blank. For the moment I do not wish to persist the metadata outside of the cluster. In a production scenario, it might be a good idea to store metadata outside of the cluster.

storage options

3 - Summary

Summary blade provides the summary of our choices made so far. It gives us last opportunity to customize the settings before Azure takes over and provisions the cluster resources for us.

summary

I will make a slight change to the cluster nodes configuration by editing the cluster size. The default number of worker nodes is 4. I don’t intend to run heavy workloads at the moment. So I reduced it down to 2 worker nodes. Also the hardware configuration of each worker node is D4 V2 type. I changed it to a scaled down version with D12 V2 type.

resize head node

I did not modify any of the advanced setting and click the Create button after final review of all the settings. It will take 15 to 20 minutes to provision all the resources of the cluster. With this setup we are ready to run some Spark jobs on the cluster. If everything goes fine you should see a HDInsight cluster created as shown below

available cluster

In the part 2 of this post, I will demonstrate how to use IntelliJ IDE to submit a Spark job to Azure HDInsight cluster. The part 3 will focus on running the Spark jobs using Azure CLI. Till then happy programming.

spacer