I recently put together a short training course on Spark. One of the initial components of the course involved deploying a Spark cluster on AWS. I wanted to have Jupyter Notebook and RStudio servers available on the master node too and the easiest way to make that happen was to install Docker and then run appropriate images.
There’s already a jupyter/pyspark-notebook image which includes Spark and Jupyter. It’s a simple matter to extend the rocker/verse image (which already includes RStudio server, the tidyverse, devtools and some publishing utilities) to include the sparklyr package.
Flintrock is a tool for launching a Spark cluster on AWS. To get it working initially I needed an IAM (Identity and Access Management) user with the following policies:
AmazonEC2FullAccess and IAMFullAccess. Without these I got errors like
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetInstanceProfile operation: User: arn:aws:iam::690534650866:user/datawookie is not authorized to perform: iam:GetInstanceProfile on resource: instance profile EMR_EC2_DefaultRole
botocore.exceptions.ClientError: An error occurred (UnauthorizedOperation) when calling the DescribeVpcs operation: You are not authorized to perform this operation.
It’d be great to interact with PySpark from a Jupyter Notebook. This post describes how to get that set up. It assumes that you’ve installed Spark like this.
Install the findspark package. $ pip3 install findspark Make sure that the SPARK_HOME environment variable is defined Launch a Jupyter Notebook. $ jupyter notebook Import the findspark package and then use findspark.init() to locate the Spark process and then load the pyspark module.
I’m busy experimenting with Spark. This is what I did to set up a local cluster on my Ubuntu machine. Before you embark on this you should first set up Hadoop.