spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitay Joffe <ni...@actioniq.co>
Subject Re: Spark S3 Performance
Date Sat, 22 Nov 2014 15:20:57 GMT
Anyone have any thoughts on this? Trying to understand especially #2 if
it's a legit bug or something I'm doing wrong.

- Nitay
Founder & CTO


On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe <nitay@actioniq.co> wrote:

> I have a simple S3 job to read a text file and do a line count.
> Specifically I'm doing *sc.textFile("s3n://mybucket/myfile").count*.The
> file is about 1.2GB. My setup is standalone spark cluster with 4 workers
> each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
> hadoop 2.4 (though I'm not actually using HDFS, just straight S3 => Spark).
>
> The whole count is taking on the order of a couple of minutes, which seems
> extremely slow.
> I've been looking into it and so far have noticed two things, hoping the
> community has seen this before and knows what to do...
>
> 1) Every executor seems to make an S3 call to read the *entire file* before
> making another call to read just it's split. Here's a paste I've cleaned up
> to show just one task: http://goo.gl/XCfyZA. I've verified this happens
> in every task. It is taking a long time (40-50 seconds), I don't see why it
> is doing this?
> 2) I've tried a few numPartitions parameters. When I make the parameter
> anything below 21 it seems to get ignored. Under the hood FileInputFormat
> is doing something that always ends up with at least 21 partitions of ~64MB
> or so. I've also tried 40, 60, and 100 partitions and have seen that the
> performance only gets worse as I increase it beyond 21. I would like to try
> 8 just to see, but again I don't see how to force it to go below 21.
>
> Thanks for the help,
> - Nitay
> Founder & CTO
>
>

Mime
View raw message