spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Malouf <malouf.g...@gmail.com>
Subject Re: Spark performance on smallerish data sets: EC2 Mediums
Date Wed, 02 Oct 2013 03:42:46 GMT
Hi Russell,

I think I did not clarify that my set up has HDFS on separate nodes from
Spark.  It sounds like your setup has them together right?


On Tue, Oct 1, 2013 at 11:23 PM, Russell Cardullo <russellcardullo@gmail.com
> wrote:

> We have a similar setup using 3 Large EC2 nodes.  We get 64MB of logs from
> flume roughly every 2 minutes pushed to S3, and are able to have Spark read
> a single 64MB file from S3 and process it in about 30 seconds (doing
> multiple maps and a reduce by key).
>
> When we first started out though we saw very long processing times around
> the order of 6 minutes for a 64 MB file.  It turned out to be caused by one
> of our map closures that was referencing a singleton object that was
> created outside of the filter closure.
>
> Don't know if that's the case here but first thing I would check is try to
> run the job locally and use something like visualvm to see how many threads
> it's using.
>
> --Russell
>
> On Oct 1, 2013, at 10:54 AM, Gary Malouf <malouf.gary@gmail.com> wrote:
>
> > Hi everyone,
> >
> > We have an HDFS set up of a namenode and three datanodes all on EC2
> mediums.  One of our data partitions basically has files that are fed from
> a few Flume instances rolling hourly.  This equates to around 3 16mb files
> right now, all though our traffic even now is projected to double in the
> next few weeks.
> >
> > Our Mesos cluster consists of a Master and three slave nodes on EC2
> mediums as well.  Spark scheduled jobs are launched from the master across
> the cluster.
> >
> > My question is, for grabbing on the order of 3 hours of data this size,
> what would the expected Spark performance be?  For a simple count query of
> our thousands od data entries serialized in these sequence files, we are
> seeing query times of around 180-200 seconds.  While this is surely faster
> than Hadoop, we were under the impression that the response times would be
> significantly faster than this.
> >
> > Has anyone tested Spark+HDFS on instances smaller than the XL's?
> >
> >
>
>

Mime
View raw message