spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: RE : Re: HDFS small file generation problem
Date Sat, 03 Oct 2015 11:56:43 GMT
Yes the most recent version yes, or you can use phoenix on top of hbase. I
recommend to try out both and see which one is the most suitable.

Le sam. 3 oct. 2015 à 13:13, nibiau <nibiau@free.fr> a écrit :

> Hello,
> Thanks if I understand correctly Hive can be a usable to my context ?
>
> Nicolas
>
>
>
>
> Envoyé depuis mon appareil mobile Samsung
>
> Jörn Franke <jornfranke@gmail.com> a écrit :
>
> If you use transactional tables in hive together with insert, update,
> delete then it does the "concatenate " for you automatically in regularly
> intervals. Currently this works only with tables in orc.format (stored as
> orc)
>
> Le sam. 3 oct. 2015 à 11:45,  <nibiau@free.fr> a écrit :
>
>> Hello,
>> So, does Hive is a solution for my need :
>> - I receive small messages (10KB) identified by ID (product ID for
>> example)
>> - Each message I receive is the last picture of my product ID, so I just
>> want basically to store last picture products inside HDFS
>> in order to process batch on it later.
>>
>> If I use Hive I suppose I have to use INSERT and UPDATE records and
>> periodically CONCATENATE.
>> After a CONCATENATE I suppose the records are still updatable.
>>
>> Tks to confirm if it can be solution for my use case. Or any other idea..
>>
>> Thanks a lot !
>> Nicolas
>>
>>
>> ----- Mail original -----
>> De: "Jörn Franke" <jornfranke@gmail.com>
>> À: nibiau@free.fr, "Brett Antonides" <bantonid@gmail.com>
>> Cc: user@spark.apache.org
>> Envoyé: Samedi 3 Octobre 2015 11:17:51
>> Objet: Re: HDFS small file generation problem
>>
>>
>>
>> You can update data in hive if you use the orc format
>>
>>
>>
>> Le sam. 3 oct. 2015 à 10:42, < nibiau@free.fr > a écrit :
>>
>>
>> Hello,
>> Finally Hive is not a solution as I cannot update the data.
>> And for archive file I think it would be the same issue.
>> Any other solutions ?
>>
>> Nicolas
>>
>> ----- Mail original -----
>> De: nibiau@free.fr
>> À: "Brett Antonides" < bantonid@gmail.com >
>> Cc: user@spark.apache.org
>> Envoyé: Vendredi 2 Octobre 2015 18:37:22
>> Objet: Re: HDFS small file generation problem
>>
>> Ok thanks, but can I also update data instead of insert data ?
>>
>> ----- Mail original -----
>> De: "Brett Antonides" < bantonid@gmail.com >
>> À: user@spark.apache.org
>> Envoyé: Vendredi 2 Octobre 2015 18:18:18
>> Objet: Re: HDFS small file generation problem
>>
>>
>>
>>
>>
>>
>>
>>
>> I had a very similar problem and solved it with Hive and ORC files using
>> the Spark SQLContext.
>> * Create a table in Hive stored as an ORC file (I recommend using
>> partitioning too)
>> * Use SQLContext.sql to Insert data into the table
>> * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to
>> merge your many small files into larger files optimized for your HDFS block
>> size
>> * Since the CONCATENATE command operates on files in place it is
>> transparent to any downstream processing
>>
>> Cheers,
>> Brett
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Oct 2, 2015 at 3:48 PM, < nibiau@free.fr > wrote:
>>
>>
>> Hello,
>> Yes but :
>> - In the Java API I don't find a API to create a HDFS archive
>> - As soon as I receive a message (with messageID) I need to replace the
>> old existing file by the new one (name of file being the messageID), is it
>> possible with archive ?
>>
>> Tks
>> Nicolas
>>
>> ----- Mail original -----
>> De: "Jörn Franke" < jornfranke@gmail.com >
>> À: nibiau@free.fr , "user" < user@spark.apache.org >
>> Envoyé: Lundi 28 Septembre 2015 23:53:56
>> Objet: Re: HDFS small file generation problem
>>
>>
>>
>>
>>
>> Use hadoop archive
>>
>>
>>
>> Le dim. 27 sept. 2015 à 15:36, < nibiau@free.fr > a écrit :
>>
>>
>> Hello,
>> I'm still investigating my small file generation problem generated by my
>> Spark Streaming jobs.
>> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
>> 10kb), and I have to store them inside HDFS in order to treat them by PIG
>> jobs on-demand.
>> The problem is the fact that I generate a lot of small files in HDFS
>> (several millions) and it can be problematic.
>> I investigated to use Hbase or Archive file but I don't want to do it
>> finally.
>> So, what about this solution :
>> - Spark streaming generate on the fly several millions of small files in
>> HDFS
>> - Each night I merge them inside a big daily file
>> - I launch my PIG jobs on this big file ?
>>
>> Other question I have :
>> - Is it possible to append a big file (daily) by adding on the fly my
>> event ?
>>
>> Tks a lot
>> Nicolas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Mime
View raw message