spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Reading from HDFS by increasing split size
Date Tue, 10 Oct 2017 20:16:24 GMT
Maybe you need to set the parameters for the mapreduce api and not the mapred api. I do not
have in mind now how they differ but the Hadoop web page should tell you ;-)

> On 10. Oct 2017, at 17:53, Kanagha Kumar <kprasad@salesforce.com> wrote:
> 
> Thanks for the inputs!!
> 
> I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the size I wanted
to read. It didn't take any effect.
> I also tried passing in spark.dfs.block.size, with all the params set to the same value.
> 
> JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath, 13);
> 
> Is there any other param that needs to be set as well?
> 
> Thanks
> 
>> On Tue, Oct 10, 2017 at 4:32 AM, ayan guha <guha.ayan@gmail.com> wrote:
>> I have not tested this, but you should be able to pass on any map-reduce like conf
to underlying hadoop config.....essentially you should be able to control behaviour of split
as you can do in a map-reduce program (as Spark uses the same input format)
>> 
>>> On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>>> Write your own input format/datasource or split the file yourself beforehand
(not recommended).
>>> 
>>> > On 10. Oct 2017, at 09:14, Kanagha Kumar <kprasad@salesforce.com>
wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path",
minPartitions).
>>> >
>>> > How can I control the no.of tasks by increasing the split size? With default
split size of 250 MB, several tasks are created. But I would like to have a specific no.of
tasks created while reading from HDFS itself instead of using repartition() etc.,
>>> >
>>> > Any suggestions are helpful!
>>> >
>>> > Thanks
>>> >
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> 
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> Ayan Guha
> 
> 
> 
> -- 
> 
> 

Mime
View raw message