Side note: I thought bzip2 was splittable. Perhaps you meant gzip?

2014년 10월 18일 토요일, Aaron Davidson<ilikerps@gmail.com>님이 작성한 메시지:
The "minPartitions" argument of textFile/hadoopFile cannot decrease the number of splits past the physical number of blocks/files. So if you have 3 HDFS blocks, asking for 2 minPartitions will still give you 3 partitions (hence the "min"). It can, however, convert a file with fewer HDFS blocks into more (so you could ask for and get 4 partitions), assuming the blocks are "splittable". HDFS blocks are usually splittable, but if it's compressed with something like bzip2, it would not be.

If you wish to combine splits from a larger file, you can use RDD#coalesce. With shuffle=false, this will simply concatenate partitions, but it does not provide any ordering guarantees (it uses an algorithm which attempts to coalesce co-located partitions, to maintain locality information). 

coalesce() with shuffle=true causes all of the elements will be shuffled around randomly into new partitions, which is an expensive operation but guarantees uniformity of data distribution.

On Sat, Oct 18, 2014 at 10:47 AM, Mayur Rustagi <mayur.rustagi@gmail.com> wrote:
Does it retain the order if its pulling from the hdfs blocks, meaning 
if  file1 => a, b, c partition in order
if I convert to 2 partition read will it map to ab, c or a, bc or it can also be a, cb ?



On Sat, Oct 18, 2014 at 9:09 AM, Ilya Ganelin <ilganeli@gmail.com> wrote:

Also - if you're doing a text file read you can pass the number of resulting partitions as the second argument.

On Oct 17, 2014 9:05 PM, "Larry Liu" <larryliu05@gmail.com> wrote:
Thanks, Andrew. What about reading out of local?

On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash <andrew@andrewash.com> wrote:
When reading out of HDFS it's the HDFS block size.

On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <larryliu05@gmail.com> wrote:
What is the default input split size? How to change it?