spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Cardullo <>
Subject Re: Spark performance on smallerish data sets: EC2 Mediums
Date Wed, 02 Oct 2013 03:23:28 GMT
We have a similar setup using 3 Large EC2 nodes.  We get 64MB of logs from flume roughly every
2 minutes pushed to S3, and are able to have Spark read a single 64MB file from S3 and process
it in about 30 seconds (doing multiple maps and a reduce by key).  

When we first started out though we saw very long processing times around the order of 6 minutes
for a 64 MB file.  It turned out to be caused by one of our map closures that was referencing
a singleton object that was created outside of the filter closure.  

Don't know if that's the case here but first thing I would check is try to run the job locally
and use something like visualvm to see how many threads it's using.


On Oct 1, 2013, at 10:54 AM, Gary Malouf <> wrote:

> Hi everyone,
> We have an HDFS set up of a namenode and three datanodes all on EC2 mediums.  One of
our data partitions basically has files that are fed from a few Flume instances rolling hourly.
 This equates to around 3 16mb files right now, all though our traffic even now is projected
to double in the next few weeks.
> Our Mesos cluster consists of a Master and three slave nodes on EC2 mediums as well.
 Spark scheduled jobs are launched from the master across the cluster.  
> My question is, for grabbing on the order of 3 hours of data this size, what would the
expected Spark performance be?  For a simple count query of our thousands od data entries
serialized in these sequence files, we are seeing query times of around 180-200 seconds. 
While this is surely faster than Hadoop, we were under the impression that the response times
would be significantly faster than this.
> Has anyone tested Spark+HDFS on instances smaller than the XL's?

View raw message