Looks like the problem is df.rdd does not work very well with limit. In scala, df.limit(1).rdd will also trigger the issue you observed. I will add this in the jira.

On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam <chilinglam@gmail.com> wrote:
I just noticed you found 1.4 has the same issue. I added that as well in the ticket.

On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chilinglam@gmail.com> wrote:
Hi Yin,

You are right! I just tried the scala version with the above lines, it works as expected. 
I'm not sure if it happens also in 1.4 for pyspark but I thought the pyspark code just calls the scala code via py4j. I didn't expect that this bug is pyspark specific. That surprises me actually a bit. I created a ticket for this (SPARK-10731).

Best Regards,


On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yhuai@databricks.com> wrote:
btw, does 1.4 has the same problem?

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yhuai@databricks.com> wrote:
Hi Jerry,

Looks like it is a Python-specific issue. Can you create a JIRA?



On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chilinglam@gmail.com> wrote:
Hi Spark Developers,

I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first().

For your reference, this is what I did in pyspark 1.5:

The above lines take over 15 minutes. I was frustrated because I can do better without using spark :) Since I like spark, so I tried to figure out why. It seems the dataframe requires 3 stages to give me the first row. It reads all data (which is about 1 billion rows) and run Limit twice. 

Instead of head(), show(1) runs much faster. Not to mention that if I do:

df.rdd.take(1) //runs much faster. 

Is this expected? Why head/first/take is so slow for dataframe? Is it a bug in the optimizer? or I did something wrong? 

Best Regards,