Hi Davies:

As agreed, this is the output of the profile. Do you see anything suspicious? 


This is the code run (in pyspark with the conf above):

input = sc.textFile(inputFile)
input.count()


Inline image 1


Best, 

Guillaume Guy
 +1 919 - 972 - 8750

On Sat, Feb 28, 2015 at 8:13 AM, Davies Liu <davies@databricks.com> wrote:
No. It should not be that slow. In my Mac, it took 1.4 minutes to do
`rdd.count()` on 4.3G text file ( 25M / s / CPU).

Could you turn on profile in pyspark to see what happened in Python process?

spark.python.profile = true

On Fri, Feb 27, 2015 at 4:14 PM, Guillaume Guy
<guillaume.c.guy@gmail.com> wrote:
> It is a simple text file.
>
> I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it?
>
>
> On Friday, February 27, 2015, Davies Liu <davies@databricks.com> wrote:
>>
>> What is this dataset? text file or parquet file?
>>
>> There is an issue with serialization in Spark SQL, which will make it
>> very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will
>> be fixed very soon.
>>
>> Davies
>>
>> On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy
>> <guillaume.c.guy@gmail.com> wrote:
>> > Hi Sean:
>> >
>> > Thanks for your feedback. Scala is much faster. The count is performed
>> > in ~1
>> > minutes (vs 17min). I would expect scala to be 2-5X faster but this gap
>> > seems to be more than that. Is that also your conclusion?
>> >
>> > Thanks.
>> >
>> >
>> > Best,
>> >
>> > Guillaume Guy
>> >  +1 919 - 972 - 8750
>> >
>> > On Fri, Feb 27, 2015 at 9:12 AM, Sean Owen <sowen@cloudera.com> wrote:
>> >>
>> >> That's very slow, and there are a lot of possible explanations. The
>> >> first one that comes to mind is: I assume your YARN and HDFS are on
>> >> the same machines, but are you running executors on all HDFS nodes
>> >> when you run this? if not, a lot of these reads could be remote.
>> >>
>> >> You have 6 executor slots, but your data exists in 96 blocks on HDFS.
>> >> You could read with up to 96-way parallelism. You say you're CPU-bound
>> >> though, but normally I'd wonder if this was simply a case of
>> >> under-using parallelism.
>> >>
>> >> I also wonder if the bottleneck is something to do with pyspark in
>> >> this case; might be good to just try it in the spark-shell to check.
>> >>
>> >> On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy
>> >> <guillaume.c.guy@gmail.com> wrote:
>> >> > Dear Spark users:
>> >> >
>> >> > I want to see if anyone has an idea of the performance for a small
>> >> > cluster.
>> >> >
>> >> > Reading from HDFS, what should be the performance of  a count()
>> >> > operation on
>> >> > an 10GB RDD with 100M rows using pyspark. I looked into the CPU
>> >> > usage,
>> >> > all 6
>> >> > are at 100%.
>> >> >
>> >> > Details:
>> >> >
>> >> > master yarn-client
>> >> > num-executors 3
>> >> > executor-cores 2
>> >> > driver-memory 5g
>> >> > executor-memory 2g
>> >> > Distribution: Cloudera
>> >> >
>> >> > I also attached the screenshot.
>> >> >
>> >> > Right now, I'm at 17 minutes which seems quite slow. Any idea how a
>> >> > decent
>> >> > performance with similar configuration?
>> >> >
>> >> > If it's way off, I would appreciate any pointers as to ways to
>> >> > improve
>> >> > performance.
>> >> >
>> >> > Thanks.
>> >> >
>> >> > Best,
>> >> >
>> >> > Guillaume
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>> >
>
>
>
> --
>
> Best,
>
> Guillaume Guy
+1 919 - 972 - 8750
>