Sunday, 22 March 2020

How to install pyspark/spark on windows

Hi Friends,

Today I'll tell you how can you configure Pyspark on windows machine. Pyspark is the collaboration of Apache Spark and Python. In other words, it is a Python Api for Spark in which you can use the simplicity of python with the power of Apache Spark. So if you want to write Spark application with python we have to use pyspark. You can run your program in localmode after configuring pyspark. In this tutorial I'm only going to configure pyspark.

Software you need to work pyspark on you system ->

  • Python
  • Java 7 or later (It must be pre-installed in the system)
  • Hadoop winutils binary


Now follow below steps.

Step 1:
You need to install python if it is not installed on your system. Download it from the python official website according to your system configurations i.e 32bit or 64bit.




Step 2: After dowloading, now you have to install python. Double click on the downloaded file and select the option that you need for python. Please make sure that you check the option PIP and click on next and finish button.



Step:3 After installing python, now check if python is installed correctly. Open command prompt and type python and press enter. You will get something on the comand prompt if it installed correctly.

Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 14 2019, 23:09:19) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>


Step:4 Now we need to install pyspark. To install pyspak open command prompt and enter below command and press enter.

E:\> pip install pyspark

This command will install pyspark in your system. You can see the progress in the command prompt.


Wait for pyspark to be downloaded.


Step:5 You don't need to set any path for this. Now pyspark is installed on you system. You can verify it by using below command.

E:\> pyspark --version


Step: 6 Now you need to download hadoop winutils binary and you need to add that to the windows path. Winutils is a part of Hadoop ecosystem and it is not included in spark/pyspark. If you run pyspark now it will throw below execption 

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

The actual functionality of your application may run correctly ever after the exception is thrown but it is better to have it in place to avoid unnecessary problems. In order to avoid error, download winutils binary. You can download it from below url.


After downloading you need to extract the downloaded file and you need to add the location of bin folder to windows path.

Now all configuration is completed. Lets start spark.

Open command prompt and tpye pyspark and press enter.


Pyspark is installed and configured properly in your system. You can write your spark application and run in local mode. You can open spark web ui using the below url.



Enjoy :)