spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Snively <>
Subject Re: Streaming JSON From S3?
Date Thu, 22 Aug 2013 01:11:15 GMT
Hi Matei!

On Aug 20, 2013, at 11:41 AM, Matei Zaharia wrote:

> Just to understand, are you trying to do a real-time application (which is what the streaming
in Spark Streaming is for), or just to read an input file into a batch job?

Well, it's an interesting case. I'm trying to take advantage of Spark Streaming's scanning
of sources to automatically process new content, and possibly its sliding window support,
e.g. "do something with every 5 RDDs in the stream." So it's not so much that the requirements
are real time—on the contrary, the processing "in the middle" will be pretty heavyweight—but
rather that streaming offers a couple of desirable ancillary features.

> For the latter, you can pass an s3n:// URL to any of Spark's file input methods (e.g.
SparkContext.textFile). The easiest thing is if your JSON is configured to write one object
per line -- in that case, you'd be able to use textFile to get an RDD of strings (one per
line), and parse them in a map() function with your favorite JSON parser. If your objects
span multiple lines, you probably need a custom Hadoop InputFormat class for JSON that you'd
pass to SparkContext.hadoopFile. There might be formats like that out there. Basically look
into how one would read these files in a Hadoop job.
> If you are going for real-time on the other hand, you'd have to use StreamingContext.textFileStream,
which watches a directory and keeps updating the stream as files are added. You can then call
a map() on the stream of Strings you get to turn it into a stream of something else.

That's essentially what I expected. When you say "stream of Strings," is each String the entire
contents of a file? If so, that would be perfectly suitable.

> Matei

Thanks, and I look forward to meeting you at the boot camp!

View raw message