spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Boisvert <alex.boisv...@gmail.com>
Subject Re: partitions, coalesce() and parallelism
Date Wed, 25 Jun 2014 00:39:25 GMT
Yes.

scala> rawLogs.partitions.size
res1: Int = 2171



On Tue, Jun 24, 2014 at 4:00 PM, Mayur Rustagi <mayur.rustagi@gmail.com>
wrote:

> To be clear number of map tasks are determined by number of partitions
> inside the rdd hence the suggestion by Nicholas.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> So do you get 2171 as the output for that command? That command tells
>> you how many partitions your RDD has, so it’s good to first confirm that
>> rdd1 has as many partitions as you think it has.
>> ​
>>
>>
>> On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert <alex.boisvert@gmail.com>
>> wrote:
>>
>>> It's actually a set of 2171 S3 files, with an average size of about 18MB.
>>>
>>>
>>> On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas <
>>> nicholas.chammas@gmail.com> wrote:
>>>
>>>> What do you get for rdd1._jrdd.splits().size()? You might think you’re
>>>> getting > 100 partitions, but it may not be happening.
>>>> ​
>>>>
>>>>
>>>> On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert <alex.boisvert@gmail.com
>>>> > wrote:
>>>>
>>>>> With the following pseudo-code,
>>>>>
>>>>> val rdd1 = sc.sequenceFile(...) // has > 100 partitions
>>>>> val rdd2 = rdd1.coalesce(100)
>>>>> val rdd3 = rdd2 map { ... }
>>>>> val rdd4 = rdd3.coalesce(2)
>>>>> val rdd5 = rdd4.saveAsTextFile(...) // want only two output files
>>>>>
>>>>> I would expect the parallelism of the map() operation to be 100
>>>>> concurrent tasks, and the parallelism of the save() operation to be 2.
>>>>>
>>>>> However, it appears the parallelism of the entire chain is 2 -- I only
>>>>> see two tasks created for the save() operation and those tasks appear
to
>>>>> execute the map() operation as well.
>>>>>
>>>>> Assuming what I'm seeing is as-specified (meaning, how things are
>>>>> meant to be), what's the recommended way to force a parallelism of 100
on
>>>>> the map() operation?
>>>>>
>>>>> thanks!
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message