spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yin Huai <yh...@databricks.com>
Subject Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()
Date Mon, 21 Sep 2015 17:59:57 GMT
Looks like the problem is df.rdd does not work very well with limit. In
scala, df.limit(1).rdd will also trigger the issue you observed. I will add
this in the jira.

On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam <chilinglam@gmail.com> wrote:

> I just noticed you found 1.4 has the same issue. I added that as well in
> the ticket.
>
> On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chilinglam@gmail.com> wrote:
>
>> Hi Yin,
>>
>> You are right! I just tried the scala version with the above lines, it
>> works as expected.
>> I'm not sure if it happens also in 1.4 for pyspark but I thought the
>> pyspark code just calls the scala code via py4j. I didn't expect that this
>> bug is pyspark specific. That surprises me actually a bit. I created a
>> ticket for this (SPARK-10731
>> <https://issues.apache.org/jira/browse/SPARK-10731>).
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yhuai@databricks.com> wrote:
>>
>>> btw, does 1.4 has the same problem?
>>>
>>> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yhuai@databricks.com> wrote:
>>>
>>>> Hi Jerry,
>>>>
>>>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>>>
>>>> Thanks,
>>>>
>>>> Yin
>>>>
>>>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chilinglam@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Spark Developers,
>>>>>
>>>>> I just ran some very simple operations on a dataset. I was surprise by
>>>>> the execution plan of take(1), head() or first().
>>>>>
>>>>> For your reference, this is what I did in pyspark 1.5:
>>>>> df=sqlContext.read.parquet("someparquetfiles")
>>>>> df.head()
>>>>>
>>>>> The above lines take over 15 minutes. I was frustrated because I can
>>>>> do better without using spark :) Since I like spark, so I tried to figure
>>>>> out why. It seems the dataframe requires 3 stages to give me the first
row.
>>>>> It reads all data (which is about 1 billion rows) and run Limit twice.
>>>>>
>>>>> Instead of head(), show(1) runs much faster. Not to mention that if I
>>>>> do:
>>>>>
>>>>> df.rdd.take(1) //runs much faster.
>>>>>
>>>>> Is this expected? Why head/first/take is so slow for dataframe? Is it
>>>>> a bug in the optimizer? or I did something wrong?
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message