spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mathias <>
Subject Performance problems on SQL JOIN
Date Fri, 20 Jun 2014 15:16:23 GMT
Hi there,

We're trying out Spark and are experiencing some performance issues using
Spark SQL.
Anyone who can tell us if our results are normal?

We are using the Amazon EC2 scripts to create a cluster with 3
workers/executors (m1.large).
Tried both spark 1.0.0 as well as the git master; the Scala as well as the
Python shells.

Running the following code takes about 5 minutes, which seems a long time
for this query.

val file = sc.textFile("s3n:// ...  .csv");
val data = => x.split('|')); // 300k rows

case class BookingInfo(num_rooms: String, hotelId: String, toDate: String,
val rooms2 = data.filter(x => x(0) == "2").map(x => BookingInfo(x(0), x(1),
... , x(9))); // 50k rows
val rooms3 = data.filter(x => x(0) == "3").map(x => BookingInfo(x(0), x(1),
... , x(9))); // 30k rows


sql("SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId =
rooms3.hotelId AND rooms2.toDate = rooms3.toDate").count();

Are we doing something wrong here?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message