spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Performance problems on SQL JOIN
Date Sat, 21 Jun 2014 18:19:50 GMT
Its probably because our LEFT JOIN performance isn't super great ATM since
we'll use a nest loop join. Sorry! We are aware of the problem and there is
a JIRA to let us do this with a HashJoin instead. If you are feeling brave
you might try pulling in the related PR.

https://issues.apache.org/jira/browse/SPARK-2212


On Fri, Jun 20, 2014 at 8:16 AM, mathias <mathias@socialsignificance.co.uk>
wrote:

> Hi there,
>
> We're trying out Spark and are experiencing some performance issues using
> Spark SQL.
> Anyone who can tell us if our results are normal?
>
> We are using the Amazon EC2 scripts to create a cluster with 3
> workers/executors (m1.large).
> Tried both spark 1.0.0 as well as the git master; the Scala as well as the
> Python shells.
>
> Running the following code takes about 5 minutes, which seems a long time
> for this query.
>
> val file = sc.textFile("s3n:// ...  .csv");
> val data = file.map(x => x.split('|')); // 300k rows
>
> case class BookingInfo(num_rooms: String, hotelId: String, toDate: String,
> ...);
> val rooms2 = data.filter(x => x(0) == "2").map(x => BookingInfo(x(0), x(1),
> ... , x(9))); // 50k rows
> val rooms3 = data.filter(x => x(0) == "3").map(x => BookingInfo(x(0), x(1),
> ... , x(9))); // 30k rows
>
> rooms2.registerAsTable("rooms2");
> cacheTable("rooms2");
> rooms3.registerAsTable("rooms3");
> cacheTable("rooms3");
>
> sql("SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId =
> rooms3.hotelId AND rooms2.toDate = rooms3.toDate").count();
>
>
> Are we doing something wrong here?
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message