spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <>
Subject Re: Grouping and storing unordered time series data stream to HDFS
Date Fri, 15 May 2015 13:59:10 GMT

Do you have a cut off time, like how "late" an event can be? Else, you may
consider a different persistent storage like Cassandra/Hbase and delegate
"update: part to them.

On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati <> wrote:

> Hi all,
> I have a stream of data from Kafka that I want to process and store in
> hdfs using Spark Streaming.
> Each data has a date/time dimension and I want to write data within the
> same time dimension to the same hdfs directory. The data stream might be
> unordered (by time dimension).
> I'm wondering what are the best practices in grouping/storing time series
> data stream using Spark Streaming?
> I'm considering grouping each batch of data in Spark Streaming per time
> dimension and then saving each group to different hdfs directories. However
> since it is possible for data with the same time dimension to be in
> different batches, I would need to handle "update" in case the hdfs
> directory already exists.
> Is this a common approach? Are there any other approaches that I can try?
> Thank you!
> Nisrina.

Best Regards,
Ayan Guha

View raw message