Sign in
Log inSign up

Spark With Python(PySpark)

shaifali agarwal's photo
shaifali agarwal
·Jun 26, 2022·

2 min read

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities, using PySpark we can run applications parallelly on the distributed cluster (multiple nodes). PySpark is a Python API for Apache Spark. Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.

Pyspark Installation

Install Jupyter Notebook. [pip install jupyter notebook]

Install Java 8

To run the PySpark application, you would need Java 8 or a later version hence download the Java version from Oracle and install it on your system.

1.png

Post-installation, set JAVA_HOME and PATH variable.

2.png

4.png

Install Apache Spark

Download Apache spark by accessing Spark.

5.png

After download, untar the binary using 7zip and copy the underlying folder spark-3.2.1-bin-hadoop2.7 to C:/apps

Setup winutils.exe Download the winutils.exe file from winutils, and copy it to%SPARK_HOME%\binfolder. Winutils are different for each Hadoop version hence download the right version from [github.com/steveloughran/winutils]

PySpark shell Now open the command prompt and type pyspark command to run the Pyspark shell.

6.png

Spark-shell also creates a Spark context web UI and by default, it can access from [localhost:4040].

7.png

Jupyter Notebook

To write PySpark applications, you would need an IDE, there are 10’s of IDE to work with and I choose to use Jupyter notebook.

8.png 9.png

Now, set the following environment variable.

10.png

Now let’s start the Jupyter Notebook.

DataFrame from external data sources

11.png