spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: input split size
Date Sun, 19 Oct 2014 03:39:33 GMT
The "minPartitions" argument of textFile/hadoopFile cannot decrease the
number of splits past the physical number of blocks/files. So if you have 3
HDFS blocks, asking for 2 minPartitions will still give you 3 partitions
(hence the "min"). It can, however, convert a file with fewer HDFS blocks
into more (so you could ask for and get 4 partitions), assuming the blocks
are "splittable". HDFS blocks are usually splittable, but if it's
compressed with something like bzip2, it would not be.

If you wish to combine splits from a larger file, you can use RDD#coalesce.
With shuffle=false, this will simply concatenate partitions, but it does
not provide any ordering guarantees (it uses an algorithm which attempts to
coalesce co-located partitions, to maintain locality information).

coalesce() with shuffle=true causes all of the elements will be shuffled
around randomly into new partitions, which is an expensive operation but
guarantees uniformity of data distribution.

On Sat, Oct 18, 2014 at 10:47 AM, Mayur Rustagi <mayur.rustagi@gmail.com>
wrote:

> Does it retain the order if its pulling from the hdfs blocks, meaning
> if  file1 => a, b, c partition in order
> if I convert to 2 partition read will it map to ab, c or a, bc or it can
> also be a, cb ?
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Sat, Oct 18, 2014 at 9:09 AM, Ilya Ganelin <ilganeli@gmail.com> wrote:
>
>> Also - if you're doing a text file read you can pass the number of
>> resulting partitions as the second argument.
>> On Oct 17, 2014 9:05 PM, "Larry Liu" <larryliu05@gmail.com> wrote:
>>
>>> Thanks, Andrew. What about reading out of local?
>>>
>>> On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash <andrew@andrewash.com>
>>> wrote:
>>>
>>>> When reading out of HDFS it's the HDFS block size.
>>>>
>>>> On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <larryliu05@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the default input split size? How to change it?
>>>>>
>>>>
>>>>
>>>
>

Mime
View raw message