spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akshay Mendole <akshaymend...@gmail.com>
Subject mapreduce.input.fileinputformat.split.maxsize not working for spark 2.4.0
Date Sun, 24 Feb 2019 18:57:48 GMT
Hi,
   We have dfs.blocksize configured to be 512MB  and we have some large
files in hdfs that we want to process with spark application. We want to
split the files get more splits to optimise for memory but the above
mentioned parameters are not working
The max and min size params as below are configured to be 50MB still a file
which is as big as 500MB is read as one split while it is expected to split
into at least 10 input splits

SparkConf conf = new SparkConf().setAppName(jobName);

SparkContext sparkContext = new SparkContext(conf);
sparkContext.hadoopConfiguration().set("mapreduce.input.fileinputformat.split.maxsize",
"50000000");
sparkContext.hadoopConfiguration().set("mapreduce.input.fileinputformat.split.minsize",
"50000000");
JavaSparkContext sc = new JavaSparkContext(sparkContext);
sc.hadoopConfiguration().set("io.compression.codecs",
"com.hadoop.compression.lzo.LzopCodec");


Could you please suggest what could be wrong with my configuration?

Thanks,
Akshay

Mime
View raw message