spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: input split size
Date Sun, 19 Oct 2014 04:56:45 GMT
Side note: I thought bzip2 was splittable. Perhaps you meant gzip?

2014년 10월 18일 토요일, Aaron Davidson<ilikerps@gmail.com>님이 작성한 메시지:

> The "minPartitions" argument of textFile/hadoopFile cannot decrease the
> number of splits past the physical number of blocks/files. So if you have 3
> HDFS blocks, asking for 2 minPartitions will still give you 3 partitions
> (hence the "min"). It can, however, convert a file with fewer HDFS blocks
> into more (so you could ask for and get 4 partitions), assuming the blocks
> are "splittable". HDFS blocks are usually splittable, but if it's
> compressed with something like bzip2, it would not be.
>
> If you wish to combine splits from a larger file, you can use
> RDD#coalesce. With shuffle=false, this will simply concatenate partitions,
> but it does not provide any ordering guarantees (it uses an algorithm which
> attempts to coalesce co-located partitions, to maintain locality
> information).
>
> coalesce() with shuffle=true causes all of the elements will be shuffled
> around randomly into new partitions, which is an expensive operation but
> guarantees uniformity of data distribution.
>
> On Sat, Oct 18, 2014 at 10:47 AM, Mayur Rustagi <mayur.rustagi@gmail.com
> <javascript:_e(%7B%7D,'cvml','mayur.rustagi@gmail.com');>> wrote:
>
>> Does it retain the order if its pulling from the hdfs blocks, meaning
>> if  file1 => a, b, c partition in order
>> if I convert to 2 partition read will it map to ab, c or a, bc or it can
>> also be a, cb ?
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>> On Sat, Oct 18, 2014 at 9:09 AM, Ilya Ganelin <ilganeli@gmail.com
>> <javascript:_e(%7B%7D,'cvml','ilganeli@gmail.com');>> wrote:
>>
>>> Also - if you're doing a text file read you can pass the number of
>>> resulting partitions as the second argument.
>>> On Oct 17, 2014 9:05 PM, "Larry Liu" <larryliu05@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','larryliu05@gmail.com');>> wrote:
>>>
>>>> Thanks, Andrew. What about reading out of local?
>>>>
>>>> On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash <andrew@andrewash.com
>>>> <javascript:_e(%7B%7D,'cvml','andrew@andrewash.com');>> wrote:
>>>>
>>>>> When reading out of HDFS it's the HDFS block size.
>>>>>
>>>>> On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <larryliu05@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','larryliu05@gmail.com');>> wrote:
>>>>>
>>>>>> What is the default input split size? How to change it?
>>>>>>
>>>>>
>>>>>
>>>>
>>
>

Mime
View raw message