Accessing PySpark from a Jupyter Notebook

2017-07-04 Jupyter Spark Andrew B. Collier

It’d be great to interact with PySpark from a Jupyter Notebook. This post describes how to get that set up. It assumes that you’ve installed Spark like this.

  1. Install the findspark package.
    $ pip3 install findspark
  2. Make sure that the SPARK_HOME environment variable is defined
  3. Launch a Jupyter Notebook.
    $ jupyter notebook
  4. Import the findspark package and then use findspark.init() to locate the Spark process and then load the pyspark module. See below for a simple example.

Next: RStudio Environment on DigitalOcean with Docker.
Previous: Installing Hadoop on Ubuntu.