PySpark and Jupyter Quick local setup with Docker
— outcastgeekHow to quickly setup a local PySpark node to run with Jupyter using Docker…
Aspiring Data Scientists and Data Analysts out there looking to quickly get started with PySpark and Jupyter, here is a quick write up to show you how to spin up a local workspace using Docker.
First make sure you have Docker, docker-machine, docker-compose installed on your machine.
Create a new Docker machine: #
in you terminal, run the following commands
1 2 3 4 5cd /to/your/workspace mkdir learning_pyspark && cd learning_pyspark mkdir -p code data notebooks docker-machine create -d virtualbox SciMachine eval `docker-machine env SciMachine`- ### Create your docker configuration files and scripts:
in your
learning_pysparkfolder1 2touch Dockerfile touch docker-compose.ymlThe content of the Dockerfile and docker-compose.yml is below:
Dockerfile #
1 2 3 4 5 6 7 8FROM jupyter/pyspark-notebook MAINTAINER outcastgeek <outcastgeek+docker@gmail.com> WORKDIR /workspace/notebooks CMD ["/workspace/start-notebook.sh", "--NotebookApp.base_url=/workspace"]docker-compose.yml #
1 2 3 4 5 6 7 8 9learning_pyspark: build: . restart: always ports: - "4040:4040" - "8888:8888" volumes: - .:/workspaceCreate your startup script: #
still inside your learning_pyspark folder
1touch start-notebook.shAgain the content of start-notebook.sh is below:
start-notebook.sh #
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15#!/bin/bash # Change UID of NB_USER to NB_UID if it does not match if [ "$NB_UID" != $(id -u $NB_USER) ] ; then usermod -u $NB_UID $NB_USER chown -R $NB_UID $CONDA_DIR fi # Enable sudo if requested if [ ! -z "$GRANT_SUDO" ]; then echo "$NB_USER ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/notebook fi # Start the notebook server exec su $NB_USER -c "env PATH=$PATH jupyter notebook $*"Run your environment: #
From within your learning_pyspark folder
Run your container:
1docker-compose upObtain the ip address of your container:
1 2docker-machine ip SciMachineNow get to work: #
Your Jupyter workspace is available here:
http://${SciMachine IP Address}:8888/workspaceCreate a note book and run some PySpark workload in it, then your Spark UI will be available here:
http://${SciMachine IP Address}:4040
Feel free to clone https://github.com/outcastgeek/docker_pyspark.git
and play around:
Any questions, feedback, comment?