spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hudong Wang <>
Subject RE: Problem using limit clause in spark sql
Date Thu, 24 Dec 2015 00:22:37 GMT
When you call collect() it will bring all the data to the driver. Do you mean to call persist()

Subject: Problem using limit clause in spark sql
Date: Wed, 23 Dec 2015 21:26:51 +0800

Hi,I am using spark sql in a way like this:
sqlContext.sql(“select * from table limit 10000”).map(...).collect()
The problem is that the limit clause will collect all the 10,000 records into a single partition,
resulting the map afterwards running only in one partition and being really slow.I tried to
use repartition, but it is kind of a waste to collect all those records into one partition
and then shuffle them around and then collect them again.
Is there a way to work around this? BTW, there is no order by clause and I do not care which
10000 records I get as long as the total number is less or equal then 10000. 		 	   		  
View raw message