spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Wylie <>
Subject Re: Question about 'Structured Streaming'
Date Tue, 08 Aug 2017 20:30:49 GMT
I can see your point that you don't really want an external process being
used for the streaming data source....Okay so on the CSV/TSV front, I have
two follow up questions:

1) Parsing data/Schema creation: The Bro IDS logs have a 8 line header that
contains the 'schema' for the data, each log http/dns/etc will have
different columns with different data types. So would I create a specific
CSV reader inherited from the general one?  Also I'm assuming this would
need to be in Scala/Java? (I suck at both of those :)

2) Dynamic Tailing: Does the CSV/TSV data sources support dynamic tailing
and handle log rotations?

Thanks and BTW your Spark Summit talks are really well done and
informative. You're an excellent speaker.


On Tue, Aug 8, 2017 at 2:09 PM, Michael Armbrust <>

> Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to
> read bro logs, rather than a python library.  This is likely to have much
> better performance since we can do all of the parsing on the JVM without
> having to flow it though an external python process.
> On Tue, Aug 8, 2017 at 9:35 AM, Brian Wylie <>
> wrote:
>> Hi All,
>> I've read the new information about Structured Streaming in Spark, looks
>> super great.
>> Resources that I've looked at
>> -
>> -
>> in-apache-spark.html
>> -
>> -
>> ctured%20Streaming%20using%20Python%20DataFrames%20API.html
>> + YouTube videos from Spark Summit 2016/2017
>> So finally getting to my question:
>> I have Python code that yields a Python generator... this is a great
>> streaming approach within Python. I've used it for network packet
>> processing and a bunch of other stuff. I'd love to simply hook up this
>> generator (that yields python dictionaries) along with a schema definition
>> to create an  'unbounded DataFrame' as discussed in
>> in-apache-spark.html
>> Possible approaches:
>> - Make a custom receiver in Python: https://spark.apache.o
>> rg/docs/latest/streaming-custom-receivers.html
>> - Use Kafka (this is definitely possible and good but overkill for my use
>> case)
>> - Send data out a socket and use socketTextStream to pull back in (seems
>> a bit silly to me)
>> - Other???
>> Since Python Generators so naturally fit into streaming pipelines I'd
>> think that this would be straightforward to 'couple' a python generator
>> into a Spark structured streaming pipeline..
>> I've put together a small notebook just to give a concrete example
>> (streaming Bro IDS network data)
>> re/BroThon/blob/master/notebooks/Bro_IDS_to_Spark.ipynb
>> Any thoughts/suggestions/pointers are greatly appreciated.
>> -Brian

View raw message