spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 汪洋 <>
Subject Re: Problem using limit clause in spark sql
Date Thu, 24 Dec 2015 03:56:04 GMT
I see.  


> 在 2015年12月24日,上午11:44,Zhan Zhang <> 写道:
> There has to have a central point to collaboratively collecting exactly 10000 records,
currently the approach is using one single partitions, which is easy to implement. 
> Otherwise, the driver has to count the number of records in each partition and then decide
how many records  to be materialized in each partition, because some partition may not have
enough number of records, sometimes it is even empty.
> I didn’t see any straightforward walk around for this.
> Thanks.
> Zhan Zhang
> On Dec 23, 2015, at 5:32 PM, 汪洋 < <>>
>> It is an application running as an http server. So I collect the data as the response.
>>> 在 2015年12月24日,上午8:22,Hudong Wang < <>>
>>> When you call collect() it will bring all the data to the driver. Do you mean
to call persist() instead?
>>> From: <>
>>> Subject: Problem using limit clause in spark sql
>>> Date: Wed, 23 Dec 2015 21:26:51 +0800
>>> To: <>
>>> Hi,
>>> I am using spark sql in a way like this:
>>> sqlContext.sql(“select * from table limit 10000”).map(...).collect()
>>> The problem is that the limit clause will collect all the 10,000 records into
a single partition, resulting the map afterwards running only in one partition and being really
slow.I tried to use repartition, but it is kind of a waste to collect all those records into
one partition and then shuffle them around and then collect them again.
>>> Is there a way to work around this? 
>>> BTW, there is no order by clause and I do not care which 10000 records I get
as long as the total number is less or equal then 10000.

View raw message