spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: 2 input paths generate 3 partitions
Date Fri, 27 Mar 2015 23:48:55 GMT
The files sound too small to be 2 blocks in HDFS.
Did you set the defaultParallelism to be 3 in your spark?
Yong

Subject: Re: 2 input paths generate 3 partitions
From: zzhang@hortonworks.com
To: rvernica@gmail.com
CC: user@spark.apache.org
Date: Fri, 27 Mar 2015 23:15:38 +0000






Hi Rares,



The number of partition is controlled by HDFS input format, and one file may have multiple
partitions if it consists of multiple block. In you case, I think there is one file with 2
splits.



Thanks.



Zhan Zhang


On Mar 27, 2015, at 3:12 PM, Rares Vernica <rvernica@gmail.com> wrote:


Hello,



I am using the Spark shell in Scala on the localhost. I am using 
sc.textFile to read a directory. The directory looks like this (generated by another Spark
script):




part-00000
part-00001
_SUCCESS




The part-00000 has four short lines of text while
part-00001 has two short lines of text. The
_SUCCESS file is empty. When I check the number of partitions on the RDD I get:




scala> foo.partitions.length
15/03/27 14:57:31 INFO FileInputFormat: Total input paths to process : 2
res68: Int = 3




I wonder why do the two input files generate three partitions. Does Spark check the number
of lines in each file and try to generate three balanced partitions?



Thanks!
Rares






 		 	   		  
Mime
View raw message