spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Lulli <lu...@di.unipi.it>
Subject Re: RDD Partition number
Date Fri, 20 Feb 2015 11:29:25 GMT
Hi All,

Thanks for your answers.
I have one more details to point out.

It is clear now how partition number is defined for HDFS file,

However, if i have my dataset replicated on all the machines in the same
absolute path.
In this case each machine has for instance ext3 filesystem.

If i load the file in a RDD how many partitions are defined in this case
and why?

I found that Spark define a number, say K, of partitions. If i force the
partition to be <=K my parameter is ignored.
If a set a value K*>=K then Spark set K* partitions.

Thanks for your help
Alessandro


On Thu, Feb 19, 2015 at 6:27 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> bq. *blocks being 64MB by default in HDFS*
>
>
> *In hadoop 2.1+, default block size has been increased.*
> See https://issues.apache.org/jira/browse/HDFS-4053
>
> Cheers
>
> On Thu, Feb 19, 2015 at 8:32 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> What file system are you using ?
>>
>> If you use hdfs, the documentation you cited is pretty clear on how
>> partitions are determined.
>>
>> bq. file X replicated on 4 machines
>>
>> I don't think replication factor plays a role w.r.t. partitions.
>>
>> On Thu, Feb 19, 2015 at 8:05 AM, Alessandro Lulli <lulli@di.unipi.it>
>> wrote:
>>
>>> Hi All,
>>>
>>> Could you please help me understanding how Spark defines the number of
>>> partitions of the RDDs if not specified?
>>>
>>> I found the following in the documentation for file loaded from HDFS:
>>> *The textFile method also takes an optional second argument for
>>> controlling the number of partitions of the file. By default, Spark creates
>>> one partition for each block of the file (blocks being 64MB by default in
>>> HDFS), but you can also ask for a higher number of partitions by passing a
>>> larger value. Note that you cannot have fewer partitions than blocks*
>>>
>>> What is the rule for file loaded from the file systems?
>>> For instance, i have a file X replicated on 4 machines. If i load the
>>> file X in a RDD how many partitions are defined and why?
>>>
>>> Thanks for your help on this
>>> Alessandro
>>>
>>
>>
>

Mime
View raw message