spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Allman <>
Subject Re: window every n elements instead of time based
Date Wed, 08 Oct 2014 05:19:03 GMT
Hi Andrew,

The use case I have in mind is batch data serialization to HDFS, where sizing files to a certain
HDFS block size is desired. In my particular use case, I want to process 10GB batches of data
at a time. I'm not sure this is a sensible use case for spark streaming, and I was trying
to test it. However, I had trouble getting it working and in the end I decided it was more
trouble than it was worth. So I decided to split my task into two: one streaming job on small,
time-defined batches of data, and a traditional Spark job aggregating the smaller files into
a larger whole. In retrospect, I think this is the right way to go, even if a count-based
window specification was possible. Therefore, I can't suggest my use case for a count-based
window size.



On Oct 5, 2014, at 4:03 PM, Andrew Ash <> wrote:

> Hi Michael,
> I couldn't find anything in Jira for it --
> Could you or Adrian please file a Jira ticket explaining the functionality and maybe
a proposed API?  This will help people interested in count-based windowing to understand the
state of the feature in Spark Streaming.
> Thanks!
> Andrew
> On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman <> wrote:
> Hi,
> I also have a use for count-based windowing. I'd like to process data
> batches by size as opposed to time. Is this feature on the development
> roadmap? Is there a JIRA ticket for it?
> Thank you,
> Michael
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message