spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Helena Edelson <helena.edel...@datastax.com>
Subject Re: Grouping and storing unordered time series data stream to HDFS
Date Sat, 16 May 2015 12:26:50 GMT
Consider using cassandra with spark streaming and timeseries, cassandra has been doing time
series for years.
Here’s some snippets with kafka streaming and writing/reading the data back:

https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64
<https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64>

or write in the stream, read back
https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61
<https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61>

or more detailed reads back
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69
<https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69>



A CassandraInputDStream is coming, i’m working on it now.

Helena
@helenaedelson

> On May 15, 2015, at 9:59 AM, ayan guha <guha.ayan@gmail.com> wrote:
> 
> Hi
> 
> Do you have a cut off time, like how "late" an event can be? Else, you may consider a
different persistent storage like Cassandra/Hbase and delegate "update: part to them. 
> 
> On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati <nisrina.luthfiyati@gmail.com
<mailto:nisrina.luthfiyati@gmail.com>> wrote:
> 
> Hi all,
> I have a stream of data from Kafka that I want to process and store in hdfs using Spark
Streaming.
> Each data has a date/time dimension and I want to write data within the same time dimension
to the same hdfs directory. The data stream might be unordered (by time dimension).
> 
> I'm wondering what are the best practices in grouping/storing time series data stream
using Spark Streaming?
> 
> I'm considering grouping each batch of data in Spark Streaming per time dimension and
then saving each group to different hdfs directories. However since it is possible for data
with the same time dimension to be in different batches, I would need to handle "update" in
case the hdfs directory already exists.
> 
> Is this a common approach? Are there any other approaches that I can try?
> 
> Thank you!
> Nisrina.
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha


Mime
View raw message