spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <>
Subject Re: Streaming JSON From S3?
Date Thu, 22 Aug 2013 01:56:48 GMT
Hi Paul,

On Aug 21, 2013, at 6:11 PM, Paul Snively <> wrote:

>> Just to understand, are you trying to do a real-time application (which is what the
streaming in Spark Streaming is for), or just to read an input file into a batch job?
> Well, it's an interesting case. I'm trying to take advantage of Spark Streaming's scanning
of sources to automatically process new content, and possibly its sliding window support,
e.g. "do something with every 5 RDDs in the stream." So it's not so much that the requirements
are real time—on the contrary, the processing "in the middle" will be pretty heavyweight—but
rather that streaming offers a couple of desirable ancillary features.

Got it; that's fine as a use case for Spark Streaming

> That's essentially what I expected. When you say "stream of Strings," is each String
the entire contents of a file? If so, that would be perfectly suitable.

No, unfortunately each String is one line of text. You'd have to create a Hadoop InputFormat
that returns one record per file if you wanted that. Maybe we should add that as a feature
in Spark by default, because it does seem like a useful way to run it.


View raw message