Configuring Pycharm 2016.2 for PySpark

Today I was setting up my old mac to do some analytics work using Spark 2.0.0 and the Pycharm version 2016.2. I had some trouble setting up the simplest of pyspark tests as I ran into this error...

No module named pyspark.

I figured there was something wrong with my classpath, so I researched how others had solved this issue. Most of the solutions led me to two changes...

  1. Adding SPARK_HOME and PYTHONPATH to the run configuration's environment variables

  2. Adding spark's python folder and py4j-[version]-src.zip to the project interpreter's classpath

The first of these was easy for me to follow, so I suggest following the instructions for that first part here. However I couldn't find proper instructions for adding files to the project interpreter's classpath in the latest pycharm. So I poked around pycharm for 10 minutes in the dark looking at my options. The following is what I found and hopefully you find this before wasting anymore time.

The latest pycharm no longer provides direct access to project interpreter's classpath. Instead you'll need to add content roots. Here's what you do. Assume [SPARK_HOME] is the absolute path to your local spark installation.

  1. Goto Preferences > Project: [Project Name] > Project Structure

  2. Click 'Add Content Root' or alt-c on mac os x

  3. Add [SPARK_HOME]/python

  4. Click 'Add Content Root'

  5. Add [SPARK_HOME]/python/lib/py4j-[version]-src.zip

  6. Open a beer or have a coffee if its too early ;)

That's it. I hope this helps someone out there.

Mike7 Comments