spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Sparks <>
Subject Re: Streaming JSON From S3?
Date Thu, 22 Aug 2013 02:24:39 GMT
You can always use some non-split table file format (eg gzip) and then
a binary input format to get the "file at a time" behavior you're
looking for.

On Aug 21, 2013, at 9:57 PM, Matei Zaharia <> wrote:

> Hi Paul,
> On Aug 21, 2013, at 6:11 PM, Paul Snively <> wrote:
>>> Just to understand, are you trying to do a real-time application (which is what
the streaming in Spark Streaming is for), or just to read an input file into a batch job?
>> Well, it's an interesting case. I'm trying to take advantage of Spark Streaming's
scanning of sources to automatically process new content, and possibly its sliding window
support, e.g. "do something with every 5 RDDs in the stream." So it's not so much that the
requirements are real time—on the contrary, the processing "in the middle" will be pretty
heavyweight—but rather that streaming offers a couple of desirable ancillary features.
> Got it; that's fine as a use case for Spark Streaming
>> That's essentially what I expected. When you say "stream of Strings," is each String
the entire contents of a file? If so, that would be perfectly suitable.
> No, unfortunately each String is one line of text. You'd have to create a Hadoop InputFormat
that returns one record per file if you wanted that. Maybe we should add that as a feature
in Spark by default, because it does seem like a useful way to run it.
> Matei

View raw message