Configuring Pycharm 2016.2 for PySpark
Today I was setting up my old mac to do some analytics work using Spark 2.0.0 and the Pycharm version 2016.2. I had some trouble setting up the simplest of pyspark tests as I ran into this error...
No module named pyspark.
I figured there was something wrong with my classpath, so I researched how others had solved this issue. Most of the solutions led me to two changes...
Adding SPARK_HOME and PYTHONPATH to the run configuration's environment variables
Adding spark's python folder and py4j-[version]-src.zip to the project interpreter's classpath
The first of these was easy for me to follow, so I suggest following the instructions for that first part here. However I couldn't find proper instructions for adding files to the project interpreter's classpath in the latest pycharm. So I poked around pycharm for 10 minutes in the dark looking at my options. The following is what I found and hopefully you find this before wasting anymore time.
The latest pycharm no longer provides direct access to project interpreter's classpath. Instead you'll need to add content roots. Here's what you do. Assume [SPARK_HOME] is the absolute path to your local spark installation.
Goto Preferences > Project: [Project Name] > Project Structure
Click 'Add Content Root' or alt-c on mac os x
Add [SPARK_HOME]/python
Click 'Add Content Root'
Add [SPARK_HOME]/python/lib/py4j-[version]-src.zip
Open a beer or have a coffee if its too early ;)
That's it. I hope this helps someone out there.