spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 汪洋 <tiandiwo...@icloud.com>
Subject Re: Problem using limit clause in spark sql
Date Thu, 24 Dec 2015 01:32:57 GMT
It is an application running as an http server. So I collect the data as the response.

> 在 2015年12月24日,上午8:22,Hudong Wang <justupload@hotmail.com> 写道:
> 
> When you call collect() it will bring all the data to the driver. Do you mean to call
persist() instead?
> 
> From: tiandiwoxin@icloud.com
> Subject: Problem using limit clause in spark sql
> Date: Wed, 23 Dec 2015 21:26:51 +0800
> To: user@spark.apache.org
> 
> Hi,
> I am using spark sql in a way like this:
> 
> sqlContext.sql(“select * from table limit 10000”).map(...).collect()
> 
> The problem is that the limit clause will collect all the 10,000 records into a single
partition, resulting the map afterwards running only in one partition and being really slow.I
tried to use repartition, but it is kind of a waste to collect all those records into one
partition and then shuffle them around and then collect them again.
> 
> Is there a way to work around this? 
> BTW, there is no order by clause and I do not care which 10000 records I get as long
as the total number is less or equal then 10000.


Mime
View raw message