spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Barney <jamesbarne...@gmail.com>
Subject Re: Sample sql query using pyspark
Date Tue, 01 Mar 2016 15:01:38 GMT
Maurin,

I don't know the technical reason why but: try removing the 'limit 100'
part of your query. I was trying to do something similar the other week and
what I found is that each executor doesn't necessarily get the same 100
rows. Joins would fail or result with a bunch of nulls when keys weren't
found between the slices of 100 rows.

Once I removed the 'limit xxxx' part of my query, all the results were the
same across the board and taking samples worked again.

If the amount of data is too large, or you're trying to just test on a
smaller size, just define another table and insert only 100 rows into that
table.

I hope that helps!

On Tue, Mar 1, 2016 at 3:10 AM, Maurin Lenglart <maurin@cuberonlabs.com>
wrote:

> Hi,
> I am trying to get a sample of a sql query in to make the query run
> faster.
> My query look like this :
> SELECT `Category` as `Category`,sum(`bookings`) as
> `bookings`,sum(`dealviews`) as `dealviews` FROM groupon_dropbox WHERE
>  `event_date` >= '2015-11-14' AND `event_date` <= '2016-02-19' GROUP BY
> `Category` LIMIT 100
>
> The table is partitioned by event_date. And the code I am using is:
>  df = self.df_from_sql(sql, srcs)
>
> results = df.sample(False, 0.5).collect()
>
>  The results are a little bit different, but the execution time is almost the same. Am
I missing something?
>
>
> thanks
>
>

Mime
View raw message