spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nisrina Luthfiyati <nisrina.luthfiy...@gmail.com>
Subject Grouping and storing unordered time series data stream to HDFS
Date Fri, 15 May 2015 10:10:54 GMT
Hi all,
I have a stream of data from Kafka that I want to process and store in hdfs
using Spark Streaming.
Each data has a date/time dimension and I want to write data within the
same time dimension to the same hdfs directory. The data stream might be
unordered (by time dimension).

I'm wondering what are the best practices in grouping/storing time series
data stream using Spark Streaming?

I'm considering grouping each batch of data in Spark Streaming per time
dimension and then saving each group to different hdfs directories. However
since it is possible for data with the same time dimension to be in
different batches, I would need to handle "update" in case the hdfs
directory already exists.

Is this a common approach? Are there any other approaches that I can try?

Thank you!
Nisrina.

Mime
View raw message