spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kronenfeld <nkronenf...@oculusinfo.com>
Subject Re: data locality, task distribution
Date Thu, 13 Nov 2014 05:01:28 GMT
Sorry, I think I was not clear in what I meant.
I didn't mean it went down within a run, with the same instance.

I meant I'd run the whole app, and one time, it would cache 100%, and the
next run, it might cache only 83%

Within a run, it doesn't change.

On Wed, Nov 12, 2014 at 11:31 PM, Aaron Davidson <ilikerps@gmail.com> wrote:

> The fact that the caching percentage went down is highly suspicious. It
> should generally not decrease unless other cached data took its place, or
> if unless executors were dying. Do you know if either of these were the
> case?
>
> On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld <
> nkronenfeld@oculusinfo.com> wrote:
>
>> Can anyone point me to a good primer on how spark decides where to send
>> what task, how it distributes them, and how it determines data locality?
>>
>> I'm trying a pretty simple task - it's doing a foreach over cached data,
>> accumulating some (relatively complex) values.
>>
>> So I see several inconsistencies I don't understand:
>>
>> (1) If I run it a couple times, as separate applications (i.e.,
>> reloading, recaching, etc), I will get different %'s cached each time.
>> I've got about 5x as much memory as I need overall, so it isn't running
>> out.  But one time, 100% of the data will be cached; the next, 83%, the
>> next, 92%, etc.
>>
>> (2) Also, the data is very unevenly distributed. I've got 400 partitions,
>> and 4 workers (with, I believe, 3x replication), and on my last run, my
>> distribution was 165/139/25/71.  Is there any way to get spark to
>> distribute the tasks more evenly?
>>
>> (3) If I run the problem several times in the same execution (to take
>> advantage of caching etc.), I get very inconsistent results.  My latest
>> try, I get:
>>
>>    - 1st run: 3.1 min
>>    - 2nd run: 2 seconds
>>    - 3rd run: 8 minutes
>>    - 4th run: 2 seconds
>>    - 5th run: 2 seconds
>>    - 6th run: 6.9 minutes
>>    - 7th run: 2 seconds
>>    - 8th run: 2 seconds
>>    - 9th run: 3.9 minuts
>>    - 10th run: 8 seconds
>>
>> I understand the difference for the first run; it was caching that time.
>> Later times, when it manages to work in 2 seconds, it's because all the
>> tasks were PROCESS_LOCAL; when it takes longer, the last 10-20% of the
>> tasks end up with locality level ANY.  Why would that change when running
>> the exact same task twice in a row on cached data?
>>
>> Any help or pointers that I could get would be much appreciated.
>>
>>
>> Thanks,
>>
>>                  -Nathan
>>
>>
>>
>> --
>> Nathan Kronenfeld
>> Senior Visualization Developer
>> Oculus Info Inc
>> 2 Berkeley Street, Suite 600,
>> Toronto, Ontario M5A 4J5
>> Phone:  +1-416-203-3003 x 238
>> Email:  nkronenfeld@oculusinfo.com
>>
>
>


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenfeld@oculusinfo.com

Mime
View raw message