Configure Standalone Spark on Windows 10

Background

Its been almost 2 years since I wrote a blog post. Hopefully the next ones will be much more frequent. This post is about my experice of setting up Spark as a standalone instance on Windows 10 64 bit machine. I got back to bit of programming after a long gap and it was quite evident that I struggledd a bit in configuring the system. Someone else coming from a .Net background and new to Java way of working might face similar difficulties tat I faced over a day to get Spark up and running.

What is Spark

Spark is an execution engine which is gaining popularity due to its ability to perform in memory parallel processing. It claims to be upto 100 times more faster compared to Hadoop MapReduce processing methods. It also fits more in the distributed computing paradigm related to big data world. One of the positives of Spark is that it can be run in standalone mode without having to setup nodes in the cluster. This also means that we do not need to set up Hadoop cluster to get started with Spark. Spark is written in Scala & support Scala, Java, Python and R languages as of writing this post in January 2016. Currently it is one of the most popular projects among the different tools used as part of Hadoop ecosystem.

What is the problem in installing Spark in stand alone mode on Windows machine?

I started with downloading a copy of Spark distribution 1.5.2 Nov 9 2015 from the Apache website. I chose the version which is pre-built for Hadoop 2.6 and later. If you prefer you can also download the source code & build the whole package. After extracting the contents of the downloaded file, I tried running the Spark-shell command from the commnand prompt. If everything is installed successfully, we should get a Scala shell to execute our commands. Unfortunately on Windows 10 64 bit machine, Spark does not start very well. This seems to be a known issue as there are multiple resources on the internet which talk about it.

When the Spark-shell command is executed, there are multiple errors which are reported on the console. The error which I received showed problems with creation of SqlContext. There was a big stack trace which was difficult to understand.

Personally this is one thing which I do not like about Java. In my past exxperience I always found it very difficult to debug issues as the error messages showed some error which may not be the correct source of the problem. I wish Java based tools and applications in future will be easier to deploy. In one sense it is good that it makes us aware of many of the internal things, but on the other hand sometimes you just want to install the stuff and get startedd with it without wanting to spend days configuring it.

I was referring to the Pluralsight course relted to Apache Spark fundamentals. The getting started and the installatioon module of the course was helpful in the first step to resolve the issue related to Spark. As suggested in the course, I changed the verbosity of the output for Spark from INFO to ERROR and the amount of info on the consoe reduced a lot. With this change, I was immediately able to get the error related to missing Winutils which is like a utility required specifically for Windows systems. This is reported as an issue SPARK-2356 in the Spark issue list.

After copying the Winutils.exe file from the pluralsight course in the Spark installation’s bin folder, I was getting the permissions error for the tmp/Hive folder error. As reccommended in different online posts, I tried changing the permissions using chmod and setting it to 777. This did not seem to fix the issue. I tried running the command with administrative previlages. Still no luck.I updated the PATH environemnt variable to point to the Spark\bin directory. As suggested, I added the Spark_HOME, HADOOP_HOME to environment variables. Initially I had put the Winutils.exe file in the Spark/bin folder. I moved it out to dedicated directory named Winutils and updated the environemnt variable for HADOOP_HOME to this directory. Still no luck.

As many people had experienced the same problem with the latest version of Spark 1.5.2, I thought of trying an older version. Even in 1.5.1 I had the same issue. I went back to 1.4.2 version released in November 2014 and that seemed to create the SqlContext correctly. but the version is more than a year old, so there wass no point sticking to the outdated version.

At this stage I was contemplating the option of getting the source code and building it from scratch. Having read in multiple posts about setting JAVA_HOME environment variable I thought of trying this apparoach. I downloaded the Java 7 SDK and created the environment variable to point to the location where jdk was installed. Even this did not solve the problem.

Use right version of Winutils

As a last option, I decided to download the Winutils.exe from a different source. In the downloaded contents, I got Winutils and some other dlls as well like Hadoop.dll as shown in the figure below.

After putting these contents in the Winutils directory and running the Spark-shell command everything was in place and SqlContext was successfully created.

I am not really sure which step fixed the issue. Was it the jdk and setting of JAVA_HOME environment.Or was it the update of winutils exe along with other dll. All this setup was quite time consuming. Hope this is helpful for people trying to setup standalone instance of Spark on Windows 10 machines.

While I was trying to get Spark up & running, I found following links which might be helpful in case you face similar issues

The last one was really helpful from where I took the idea of separating Winutils exe into different folder and also to install JDK & Scala. But setting scala envirnment variables were not required as I was able to get the scala prompt without scala installation.

Conclusion

Following are the steps I followed for installing Standalone instance of Spark on Windows 10 64 bit machine

JDK (6 or higher version)
Download Spark distribution
Download correct version of Winutils.exe dll
Set Environment variables for JAVA_HOME, SPARK_HOME & HADOOP_HOME

Note : When running the chmod command to set 777 attributes for tmp/hive directory make sure to run the command prompt with Administrative privilages.

Technorati Tags: Spark,Windows 10,Winutils.exe,SqlContext

17 comments:

Anonymous21 January, 2016 14:36
Hi, many thanks for this post. I am still fighting with it. I have windows 10 but 32 bits and I downloaded Spark 1.6 built for Hadoop 2.6
I guess my problem is, I am not able to find the right winutils and I start to think also in building my own spark version. If I make some progress I will share it with you.
Anonymous09 February, 2016 18:22
Hi Nilesh,

Thanks for such a detailed post.

I've tried these instructions but still struggling to get "spark-shell" to run cleanly. Could you please help me here?

Here are the details of the folders and env variables created/modified in the process:
1. Downloaded spark's prebuilt image (spark-1.6.0-bin-hadoop2.4) and placed it in "C:\", so the structure looks like "C:\spark\bin"
Created an env variable SPARK_HOME with the "C:\spark\" as value.

2. Downloaded Winutils from the link mentioned in your post and placed (and extracted) in "C:\"

Now, the env variable HADOOP_HOME was pointing to "C:\Winutils\".

3. Instaled Scala and created SCALA_HOME and made it to point to Scala's directory.

4. I already had a working version of java 8 installed with the required env variables in place.

I can't seem to figure out where exactly am i going wrong. I'll try going through the Pluralsight course that you've mentioned but would appreciate if you could have a look at this and see if these seem familiar?

Thanks in advance.
Lalit
Web Designers Pitampura11 March, 2016 02:35
I don’t know how should I give you thanks! I am totally stunned by your article. You saved my time. Thanks a million for sharing this article.
Unknown07 April, 2016 13:33
I was really missing your posts. This was quite a great article to get spark running.
GK13 April, 2016 01:47
Hi,
The alternative link you provided for downloading winutils works for me (Windows 10 64 bit) From what I can see the winutils on the official links are 32bit as I get an error when I tried running the command
winutils.exe chmod 777 \tmp\hive

after downloading from your link that command worked
HTH, Garry
Anonymous11 October, 2016 12:20
Hi thank you very much for your post , I keep getting a 'the specified path can not be found' when I try to launch spark-shell. It would suggest that my env.variables are wrongly set but they seem correct after 20 checks ! Any pointers would be appreciated , thanks in advance.
Anonymous19 October, 2016 05:22
Were you able to get Standalone Spark to work with multiple workers? I'm trying version 2.0 on windows server and setting the SPARK_WORKER_INSTANCES doesn't seem to work.
Unknown23 January, 2017 21:48
The blog gave me idea to configure standalone spark Thanks for sharing it
Hadoop Training in Chennai
Anonymous08 April, 2017 14:26
spark-shell is working fine but pyspark is not working.Can any one tell what will b the problem

Nilesh Gule's Technical Blog

My views about Software Design and Architecture, Microsoft DotNet, Big Data and Cloud Computing. Driven by passion and Committed to excellence.

Configure Standalone Spark on Windows 10

Background

What is Spark

What is the problem in installing Spark in stand alone mode on Windows machine?

Use right version of Winutils

Conclusion

17 comments:

Nilesh Gule's Technical Blog

My views about Software Design and Architecture around Microsoft DotNet, Big Data and Cloud Computing. Strong believer of developer community, passionate about Software development and Architecture. Always committed to excellence in technology.

About me

Nilesh Gule

Popular Posts

Recent Posts

Categories

Pages

Blog Archive

Contact Me