spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego García Valverde <>
Subject RE: Control default partition when load a RDD from HDFS
Date Wed, 17 Dec 2014 16:03:32 GMT
Why not is a good option to create a RDD per each 200Mb file and then apply the pre-calculations
before merging them? I think the partitions per RDD must be transparent to the pre-calculations,
and not to set them fixed to optimize the spark maps/reduces processes.

De: Shuai Zheng []
Enviado el: miércoles, 17 de diciembre de 2014 16:01
Para: 'Sun, Rui';
Asunto: RE: Control default partition when load a RDD from HDFS

Nice, that is the answer I want.

From: Sun, Rui []
Sent: Wednesday, December 17, 2014 1:30 AM
To: Shuai Zheng;
Subject: RE: Control default partition when load a RDD from HDFS

Hi, Shuai,

How did you turn off the file split in Hadoop? I guess you might have implemented a customized
FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat,
you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

From: Shuai Zheng []
Sent: Wednesday, December 17, 2014 4:16 AM
Subject: Control default partition when load a RDD from HDFS

Hi All,

My application load 1000 files, each file from 200M -  a few GB, and combine with other data
to do calculation.
Some pre-calculation must be done on each file level, then after that, the result need to
combine to do further calculation.
In Hadoop, it is simple because I can turn-off the file split for input format (to enforce
each file will go to same mapper), then I will do the file level calculation in mapper and
pass result to reducer. But in spark, how can I do it?
Basically I want to make sure after I load these files into RDD, it is partitioned by file
(not split file and also no merge there), so I can call mapPartitions. Is it any way I can
control the default partition when I load the RDD?
This might be the default behavior that spark do the partition (partitioned by file when first
time load the RDD), but I can't find any document to support my guess, if not, can I enforce
this kind of partition? Because the total file size is bigger, I don't want to re-partition
in the code.




View raw message