spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Reading the most recent text files created by Spark streaming
Date Wed, 14 Sep 2016 17:36:19 GMT
Hi,

An alternative to Spark could be flume to store data from Kafka to HDFS. It provides also
some reliability mechanisms and has been explicitly designed for import/export and is tested.
Not sure if i would go for spark streaming if the use case is only storing, but I do not have
the full picture of your use case.

Anyway, what you could do is create a directory / hour/ day etc (whatever you need) and put
the corresponding files there. If there are a lot of small files you can put them into a Hadoop
Archive (HAR) to reduce load on the namenode.

Best  regards

> On 14 Sep 2016, at 17:28, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> Hi,
> 
> I have a Spark streaming that reads messages/prices from Kafka and writes it as text
file to HDFS.
> 
> This is pretty efficient. Its only function is to persist the incoming messages to HDFS.
> 
> This is what it does
>      dstream.foreachRDD { pricesRDD =>
>        val x= pricesRDD.count
>        // Check if any messages in
>        if (x > 0)
>        {
>            // Combine each partition's results into a single RDD
>          val cachedRDD = pricesRDD.repartition(1).cache
>          cachedRDD.saveAsTextFile("/data/prices/prices_" + System.currentTimeMillis.toString)
> ....
> 
> So these are the files on HDFS directory
> 
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 /data/prices/prices_1473862284010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 /data/prices/prices_1473862288010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 /data/prices/prices_1473862290010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 /data/prices/prices_1473862294010
> 
> Now I present these prices to Zeppelin. These files are produced every 2 seconds. However,
when I get to plot them, I am only interesting in one hours data say.
> I cater for this by using filter on prices (each has a TIMECREATED).
> 
> I don't think this is efficient as I don't want to load all these files. I just want
to  to read the prices created in past hour or something.
> 
> One thing I considered was to load all prices by converting System.currentTimeMillis
into today's date and fetch the most recent ones. However, this is looking cumbersome. I can
create these files with any timestamp extension when persisting but System.currentTimeMillis
seems to be most efficient.
> 
> Any alternatives you can think of?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  

Mime
View raw message