spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Colombo <ing.marco.colo...@gmail.com>
Subject Spark SQL and number of task
Date Thu, 04 Aug 2016 07:58:44 GMT
Hi all, I've a question on how hive+spark are handling data.

I've started a new HiveContext and I'm extracting data from cassandra.
I've configured spark.sql.shuffle.partitions=10.
Now, I've following query:

select d.id, avg(d.avg) from v_points d where id=90 group by id;

I see that 10 task are submitted and execution is fast. Every id on that
table has 2000 samples.

But if I just add a new id, as:

select d.id, avg(d.avg) from v_points d where id=90 or id=2 group by id;

it adds 663 task and query does not end.

If I write query with in () like

select d.id, avg(d.avg) from v_points d where id in (90,2) group by id;

query is again fast.

How can I get the 'execution plan' of the query?

And also, how can I kill the long running submitted tasks?

Thanks all!

Mime
View raw message