spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject HDFS small file generation problem
Date Sun, 27 Sep 2015 13:36:29 GMT
I'm still investigating my small file generation problem generated by my Spark Streaming jobs.
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), and I have
to store them inside HDFS in order to treat them by PIG jobs on-demand.
The problem is the fact that I generate a lot of small files in HDFS (several millions) and
it can be problematic.
I investigated to use Hbase or Archive file but I don't want to do it finally.
So, what about this solution :
- Spark streaming generate on the fly several millions of small files in HDFS
- Each night I merge them inside a big daily file
- I launch my PIG jobs on this big file ?

Other question I have :
- Is it possible to append a big file (daily) by adding on the fly my event ?

Tks a lot

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message