spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: RE : Re: HDFS small file generation problem
Date Sat, 03 Oct 2015 18:30:30 GMT
Hive was originally not designed for updates,  because it was.purely
warehouse focused, the most recent one can do updates, deletes etc in a
transactional way.
However, you may also use Hbase with phoenix for that depending on your
other functional and non-functional requirements

Le sam. 3 oct. 2015 à 16:48,  <nibiau@free.fr> a écrit :

> Thanks a lot, why you said "the most recent version" ?
>
> ----- Mail original -----
> De: "Jörn Franke" <jornfranke@gmail.com>
> À: "nibiau" <nibiau@free.fr>
> Cc: bantonid@gmail.com, user@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 13:56:43
> Objet: Re: RE : Re: HDFS small file generation problem
>
>
>
> Yes the most recent version yes, or you can use phoenix on top of hbase. I
> recommend to try out both and see which one is the most suitable.
>
>
>
> Le sam. 3 oct. 2015 à 13:13, nibiau < nibiau@free.fr > a écrit :
>
>
>
>
> Hello,
> Thanks if I understand correctly Hive can be a usable to my context ?
>
>
> Nicolas
>
>
>
>
>
>
>
>
>
> Envoyé depuis mon appareil mobile Samsung
> Jörn Franke < jornfranke@gmail.com > a écrit :
>
>
>
> If you use transactional tables in hive together with insert, update,
> delete then it does the "concatenate " for you automatically in regularly
> intervals. Currently this works only with tables in orc.format (stored as
> orc)
>
>
>
>
> Le sam. 3 oct. 2015 à 11:45, < nibiau@free.fr > a écrit :
>
>
> Hello,
> So, does Hive is a solution for my need :
> - I receive small messages (10KB) identified by ID (product ID for example)
> - Each message I receive is the last picture of my product ID, so I just
> want basically to store last picture products inside HDFS
> in order to process batch on it later.
>
> If I use Hive I suppose I have to use INSERT and UPDATE records and
> periodically CONCATENATE.
> After a CONCATENATE I suppose the records are still updatable.
>
> Tks to confirm if it can be solution for my use case. Or any other idea..
>
> Thanks a lot !
> Nicolas
>
>
> ----- Mail original -----
> De: "Jörn Franke" < jornfranke@gmail.com >
> À: nibiau@free.fr , "Brett Antonides" < bantonid@gmail.com >
> Cc: user@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 11:17:51
> Objet: Re: HDFS small file generation problem
>
>
>
> You can update data in hive if you use the orc format
>
>
>
> Le sam. 3 oct. 2015 à 10:42, < nibiau@free.fr > a écrit :
>
>
> Hello,
> Finally Hive is not a solution as I cannot update the data.
> And for archive file I think it would be the same issue.
> Any other solutions ?
>
> Nicolas
>
> ----- Mail original -----
> De: nibiau@free.fr
> À: "Brett Antonides" < bantonid@gmail.com >
> Cc: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:37:22
> Objet: Re: HDFS small file generation problem
>
> Ok thanks, but can I also update data instead of insert data ?
>
> ----- Mail original -----
> De: "Brett Antonides" < bantonid@gmail.com >
> À: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:18:18
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
>
>
>
> I had a very similar problem and solved it with Hive and ORC files using
> the Spark SQLContext.
> * Create a table in Hive stored as an ORC file (I recommend using
> partitioning too)
> * Use SQLContext.sql to Insert data into the table
> * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to
> merge your many small files into larger files optimized for your HDFS block
> size
> * Since the CONCATENATE command operates on files in place it is
> transparent to any downstream processing
>
> Cheers,
> Brett
>
>
>
>
>
>
>
>
>
> On Fri, Oct 2, 2015 at 3:48 PM, < nibiau@free.fr > wrote:
>
>
> Hello,
> Yes but :
> - In the Java API I don't find a API to create a HDFS archive
> - As soon as I receive a message (with messageID) I need to replace the
> old existing file by the new one (name of file being the messageID), is it
> possible with archive ?
>
> Tks
> Nicolas
>
> ----- Mail original -----
> De: "Jörn Franke" < jornfranke@gmail.com >
> À: nibiau@free.fr , "user" < user@spark.apache.org >
> Envoyé: Lundi 28 Septembre 2015 23:53:56
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
> Use hadoop archive
>
>
>
> Le dim. 27 sept. 2015 à 15:36, < nibiau@free.fr > a écrit :
>
>
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message