spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nisrina Luthfiyati <nisrina.luthfiy...@gmail.com>
Subject Re: Grouping and storing unordered time series data stream to HDFS
Date Sat, 16 May 2015 18:00:22 GMT
Hi Ayan and Helena,

I've considered using Cassandra/HBase but ended up opting to save to worker
hdfs because I want to take advantage of the data locality since the data
will often be loaded to Spark for further processing. I was also under the
impression that saving to filesystem (instead of db) is the better option
for intermediate data. Definitely going to read up some more and reconsider
due to the time series nature of the data though.

This might be a bit out of topic, but in your experience is it common to
store intermediate data that will be loaded to Spark plenty of times in the
future in Cassandra?

Regarding on how late a data can be, I might be able to set the limit.
Would you know if it's possible to combine RDDs from different interval in
Spark Streaming? Or would I need to write to file first then group the data
by time dimension in other batch processing?

Thanks in advance!
Nisrina.
 On May 16, 2015 7:26 PM, "Helena Edelson" <helena.edelson@datastax.com>
wrote:

> Consider using cassandra with spark streaming and timeseries, cassandra
> has been doing time series for years.
> Here’s some snippets with kafka streaming and writing/reading the data
> back:
>
>
> https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64
>
> or write in the stream, read back
>
> https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61
>
> or more detailed reads back
>
> https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69
>
>
>
> A CassandraInputDStream is coming, i’m working on it now.
>
> Helena
> @helenaedelson
>
> On May 15, 2015, at 9:59 AM, ayan guha <guha.ayan@gmail.com> wrote:
>
> Hi
>
> Do you have a cut off time, like how "late" an event can be? Else, you may
> consider a different persistent storage like Cassandra/Hbase and delegate
> "update: part to them.
>
> On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati <
> nisrina.luthfiyati@gmail.com> wrote:
>
>>
>> Hi all,
>> I have a stream of data from Kafka that I want to process and store in
>> hdfs using Spark Streaming.
>> Each data has a date/time dimension and I want to write data within the
>> same time dimension to the same hdfs directory. The data stream might be
>> unordered (by time dimension).
>>
>> I'm wondering what are the best practices in grouping/storing time series
>> data stream using Spark Streaming?
>>
>> I'm considering grouping each batch of data in Spark Streaming per time
>> dimension and then saving each group to different hdfs directories. However
>> since it is possible for data with the same time dimension to be in
>> different batches, I would need to handle "update" in case the hdfs
>> directory already exists.
>>
>> Is this a common approach? Are there any other approaches that I can try?
>>
>> Thank you!
>> Nisrina.
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>

Mime
View raw message