spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deenar Toraskar <>
Subject Re: HDFS small file generation problem
Date Mon, 28 Sep 2015 05:04:11 GMT
You could try a couple of things

a) use Kafka for stream processing, store current incoming events and spark
streaming job ouput in Kafka rather than on HDFS and dual write to HDFS too
(in a micro batched mode), so every x minutes. Kafka is more suited to
processing lots of small events/
b) Coalesce small files on HDFS into a big hourly, daily file. Use HDFS
partitioning to ensure that your pig job reads the least amount of


On 27 September 2015 at 14:47, ayan guha <> wrote:

> I would suggest not to write small files to hdfs. rather you can hold them
> in memory, maybe off heap. and then you may flush it to hdfs using another
> job. similar to (not sure if spark
> already has something like it)
> On Sun, Sep 27, 2015 at 11:36 PM, <> wrote:
>> Hello,
>> I'm still investigating my small file generation problem generated by my
>> Spark Streaming jobs.
>> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
>> 10kb), and I have to store them inside HDFS in order to treat them by PIG
>> jobs on-demand.
>> The problem is the fact that I generate a lot of small files in HDFS
>> (several millions) and it can be problematic.
>> I investigated to use Hbase or Archive file but I don't want to do it
>> finally.
>> So, what about this solution :
>> - Spark streaming generate on the fly several millions of small files in
>> - Each night I merge them inside a big daily file
>> - I launch my PIG jobs on this big file ?
>> Other question I have :
>> - Is it possible to append a big file (daily) by adding on the fly my
>> event ?
>> Tks a lot
>> Nicolas
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> --
> Best Regards,
> Ayan Guha

View raw message