spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Wylie <briford.wy...@gmail.com>
Subject PySpark, Structured Streaming and Kafka
Date Wed, 23 Aug 2017 20:41:28 GMT
Hi All,

I'm trying the new hotness of using Kafka and Structured Streaming.

Resources that I've looked at
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
- https://databricks.com/blog/2016/07/28/structured-streaming-
in-apache-spark.html
- https://spark.apache.org/docs/latest/streaming-custom-receivers.html
- http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Stru
ctured%20Streaming%20using%20Python%20DataFrames%20API.html

My setup is a bit weird (yes.. yes.. I know...)
- Eventually I'll just use a DataBricks cluster and life will be bliss :)
- But for now I want to test/try stuff out on my little Mac Laptop

The newest version of PySpark will install a local Spark server with a
simple:
$ pip install pyspark

This is very nice. I've put together a little notebook using that kewl
feature:
-
https://github.com/Kitware/BroThon/blob/master/notebooks/Bro_to_Spark_Cheesy.ipynb

So the next step is the setup/use a Kafka message queue and that went
well/works fine.

$ kafka-console-consumer --bootstrap-server localhost:9092 --topic dns

*I get messages spitting out....*

{"ts":1503513688.232274,"uid":"CdA64S2Z6Xh555","id.orig_h":"192.168.1.7","id.orig_p":58528,"id.resp_h":"192.168.1.1","id.resp_p":53,"proto":"udp","trans_id":43933,"rtt":0.02226,"query":"brian.wylie.is.awesome.tk","qclass":1,"qclass_name":"C_INTERNET","qtype":1,"qtype_name":"A","rcode":0,"rcode_name":"NOERROR","AA":false,"TC":false,"RD":true,"RA":true,"Z":0,"answers":["17.188.137.55","17.188.142.54","17.188.138.55","17.188.141.184","17.188.129.50","17.188.128.178","17.188.129.178","17.188.141.56"],"TTLs":[25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0],"rejected":false}


Okay, finally getting to my question:
- Local spark server (good)
- Local kafka server and messages getting produced (good)
- Trying to this line of PySpark code (not good)

# Setup connection to Kafka Stream dns_events =
spark.readStream.format('kafka')\
  .option('kafka.bootstrap.servers', 'localhost:9092')\
  .option('subscribe', 'dns')\
  .option('startingOffsets', 'latest')\
  .load()


fails with:
java.lang.ClassNotFoundException: Failed to find data source: kafka. Please
find packages at http://spark.apache.org/third-party-projects.html

I've looked that the URL listed... and poking around I can see that maybe I
need the kafka jar file as part of my local server.

I lamely tried this:
$ spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0

Exception in thread "main" java.lang.IllegalArgumentException: Missing
application resource. at
org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:160)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:274)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:151)
at org.apache.spark.launcher.Main.main(Main.java:86)


Anyway, all my code/versions/etc are in this notebook:
-
https://github.com/Kitware/BroThon/blob/master/notebooks/Bro_to_Spark.ipynb

I'd be tremendously appreciative of some super nice, smart person if they
could point me in the right direction :)

-Brian Wylie

Mime
View raw message