spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhan Zhang <zzh...@hortonworks.com>
Subject Re: Problem using limit clause in spark sql
Date Thu, 24 Dec 2015 03:44:08 GMT
There has to have a central point to collaboratively collecting exactly 10000 records, currently
the approach is using one single partitions, which is easy to implement.
Otherwise, the driver has to count the number of records in each partition and then decide
how many records  to be materialized in each partition, because some partition may not have
enough number of records, sometimes it is even empty.

I didn’t see any straightforward walk around for this.

Thanks.

Zhan Zhang



On Dec 23, 2015, at 5:32 PM, 汪洋 <tiandiwoxin@icloud.com<mailto:tiandiwoxin@icloud.com>>
wrote:

It is an application running as an http server. So I collect the data as the response.

在 2015年12月24日,上午8:22,Hudong Wang <justupload@hotmail.com<mailto:justupload@hotmail.com>>
写道:

When you call collect() it will bring all the data to the driver. Do you mean to call persist()
instead?

________________________________
From: tiandiwoxin@icloud.com<mailto:tiandiwoxin@icloud.com>
Subject: Problem using limit clause in spark sql
Date: Wed, 23 Dec 2015 21:26:51 +0800
To: user@spark.apache.org<mailto:user@spark.apache.org>

Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 10000”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records into a single partition,
resulting the map afterwards running only in one partition and being really slow.I tried to
use repartition, but it is kind of a waste to collect all those records into one partition
and then shuffle them around and then collect them again.

Is there a way to work around this?
BTW, there is no order by clause and I do not care which 10000 records I get as long as the
total number is less or equal then 10000.


Mime
View raw message