spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Malouf <malouf.g...@gmail.com>
Subject Spark performance on smallerish data sets: EC2 Mediums
Date Tue, 01 Oct 2013 17:54:39 GMT
Hi everyone,

We have an HDFS set up of a namenode and three datanodes all on EC2
mediums.  One of our data partitions basically has files that are fed from
a few Flume instances rolling hourly.  This equates to around 3 16mb files
right now, all though our traffic even now is projected to double in the
next few weeks.

Our Mesos cluster consists of a Master and three slave nodes on EC2 mediums
as well.  Spark scheduled jobs are launched from the master across the
cluster.

My question is, for grabbing on the order of 3 hours of data this size,
what would the expected Spark performance be?  For a simple count query of
our thousands od data entries serialized in these sequence files, we are
seeing query times of around 180-200 seconds.  While this is surely faster
than Hadoop, we were under the impression that the response times would be
significantly faster than this.

Has anyone tested Spark+HDFS on instances smaller than the XL's?

Mime
View raw message