spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From immerrr again <imme...@gmail.com>
Subject pyspark: dataframe.take is slow
Date Tue, 05 Jul 2016 09:27:58 GMT
Hi all!

I'm having a strange issue with pyspark 1.6.1. I have a dataframe,

    df = sqlContext.read.parquet('/path/to/data')

whose "df.take(10)" is really slow, apparently scanning the whole
dataset to take the first ten rows. "df.first()" works fast, as does
"df.rdd.take(10)".

I have found https://issues.apache.org/jira/browse/SPARK-10731 that
should have fixed it in 1.6.0, but it has not. What am i doing wrong
here and how can I fix this?

Cheers,
immerrr

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message