From Brian Wylie <>
Subject PySpark, Structured Streaming and Kafka
Date Wed, 23 Aug 2017 20:41:28 GMT
Hi All,

I'm trying the new hotness of using Kafka and Structured Streaming.

Resources that I've looked at

My setup is a bit weird (yes.. yes.. I know...)
- Eventually I'll just use a DataBricks cluster and life will be bliss :)
- But for now I want to test/try stuff out on my little Mac Laptop

The newest version of PySpark will install a local Spark server with a
$ pip install pyspark

This is very nice. I've put together a little notebook using that kewl

So the next step is the setup/use a Kafka message queue and that went
well/works fine.

$ kafka-console-consumer --bootstrap-server localhost:9092 --topic dns

*I get messages spitting out....*


Okay, finally getting to my question:
- Local spark server (good)
- Local kafka server and messages getting produced (good)
- Trying to this line of PySpark code (not good)

# Setup connection to Kafka Stream dns_events =
  .option('kafka.bootstrap.servers', 'localhost:9092')\
  .option('subscribe', 'dns')\
  .option('startingOffsets', 'latest')\

fails with:
java.lang.ClassNotFoundException: Failed to find data source: kafka. Please
find packages at

I've looked that the URL listed... and poking around I can see that maybe I
need the kafka jar file as part of my local server.

I lamely tried this:
$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0

Exception in thread "main" java.lang.IllegalArgumentException: Missing
application resource. at
at org.apache.spark.launcher.Main.main(

Anyway, all my code/versions/etc are in this notebook:

I'd be tremendously appreciative of some super nice, smart person if they
could point me in the right direction :)

-Brian Wylie

