Maybe you need to set the parameters for the mapreduce api and not the mapred api. I do not have in mind now how they differ but the Hadoop web page should tell you ;-)

On 10. Oct 2017, at 17:53, Kanagha Kumar <kprasad@salesforce.com> wrote:

Thanks for the inputs!!

I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the size I wanted to read. It didn't take any effect.
I also tried passing in spark.dfs.block.size, with all the params set to the same value.

JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath, 13);

Is there any other param that needs to be set as well?

Thanks

On Tue, Oct 10, 2017 at 4:32 AM, ayan guha <guha.ayan@gmail.com> wrote:
I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <jornfranke@gmail.com> wrote:
Write your own input format/datasource or split the file yourself beforehand (not recommended).

> On 10. Oct 2017, at 09:14, Kanagha Kumar <kprasad@salesforce.com> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org




--
Best Regards,
Ayan Guha



--