spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jayant Shekhar <jay...@cloudera.com>
Subject Re: window every n elements instead of time based
Date Wed, 08 Oct 2014 06:13:34 GMT
Hi Michael,

I think you are meaning batch interval instead of windowing. It can be
helpful for cases when you do not want to process very small batch sizes.

HDFS sink in Flume has the concept of rolling files based on time, number
of events or size.
https://flume.apache.org/FlumeUserGuide.html#hdfs-sink

The same could be applied to Spark if the use cases demand. The only major
catch would be that it breaks the concept of window operations which are in
Spark.

Thanks,
Jayant




On Tue, Oct 7, 2014 at 10:19 PM, Michael Allman <michael@videoamp.com>
wrote:

> Hi Andrew,
>
> The use case I have in mind is batch data serialization to HDFS, where
> sizing files to a certain HDFS block size is desired. In my particular use
> case, I want to process 10GB batches of data at a time. I'm not sure this
> is a sensible use case for spark streaming, and I was trying to test it.
> However, I had trouble getting it working and in the end I decided it was
> more trouble than it was worth. So I decided to split my task into two: one
> streaming job on small, time-defined batches of data, and a traditional
> Spark job aggregating the smaller files into a larger whole. In retrospect,
> I think this is the right way to go, even if a count-based window
> specification was possible. Therefore, I can't suggest my use case for a
> count-based window size.
>
> Cheers,
>
> Michael
>
> On Oct 5, 2014, at 4:03 PM, Andrew Ash <andrew@andrewash.com> wrote:
>
> Hi Michael,
>
> I couldn't find anything in Jira for it --
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming
>
> Could you or Adrian please file a Jira ticket explaining the functionality
> and maybe a proposed API?  This will help people interested in count-based
> windowing to understand the state of the feature in Spark Streaming.
>
> Thanks!
> Andrew
>
> On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman <michael@videoamp.com>
> wrote:
>
>> Hi,
>>
>> I also have a use for count-based windowing. I'd like to process data
>> batches by size as opposed to time. Is this feature on the development
>> roadmap? Is there a JIRA ticket for it?
>>
>> Thank you,
>>
>> Michael
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/window-every-n-elements-instead-of-time-based-tp2085p15701.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>

Mime
View raw message