spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Marscher <>
Subject Re: Repartition question
Date Tue, 04 Aug 2015 17:46:50 GMT

it is possible to control the number of partitions for the RDD without
calling repartition by setting the max split size for the hadoop input
format used. Tracing through the code, XmlInputFormat extends
FileInputFormat which determines the number of splits (which NewHadoopRdd
uses to determine number of partitions:
with a few configs:

public static final String SPLIT_MAXSIZE =
> "mapreduce.input.fileinputformat.split.maxsize";
> public static final String SPLIT_MINSIZE =
If you are setting SparkConf fields, prefix the keys with spark.hadoop and
they will end up on the Hadoop conf used for the above values.

On Tue, Aug 4, 2015 at 12:31 AM, Naveen Madhire <>

> Hi All,
> I am running the WikiPedia parsing example present in the "Advance
> Analytics with Spark" book.
> The partitions of the RDD returned by the readFile function (mentioned
> above) is of 32MB size. So if my file size is 100 MB, RDD is getting
> created with 4 partitions with approx 32MB  size.
> I am running this in a standalone spark cluster mode, every thing is
> working fine only little confused about the nbr of partitions and the size.
> I want to increase the nbr of partitions for the RDD to make use of the
> cluster. Is calling repartition() after this the only option or can I pass
> something in the above method to have more partitions of the RDD.
> Please let me know.
> Thanks.

*Richard Marscher*
Software Engineer
Localytics <> | Our Blog
<> | Twitter <> |
Facebook <> | LinkedIn

View raw message