spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <pwend...@gmail.com>
Subject Re: Job duration
Date Mon, 28 Oct 2013 16:24:47 GMT
Hey Lucas,

Could you provide some rough psuedo-code for your job? One question is: are
you loading the data from cassandra every time you perform an action, or do
you cache() the dataset first? If you have a dataset that's already in an
RDD, it's very hard for me to imaging that filters and aggregations could
possibly take 4 minutes... should be more like seconds.

- Patrick


On Mon, Oct 28, 2013 at 9:11 AM, Lucas Fernandes Brunialti <
lbrunialti@igcorp.com.br> wrote:

> Hello,
>
> We're using Spark to run analytics and ML jobs against Cassandra. Our
> analytics jobs are simple (filters and counts) and we're trying to improve
> the performance, these jobs takes around 4 minutes querying 160Gb (size of
> our dataset). Also, we use 5 workers and 1 master, EC2 m1.xlarge with 8gb
> in jvm heap.
>
> We tried to increase the jvm heap to 12gb, but we had no gain in
> performance. We're using CACHE_ONLY (after some tests we've found it
> better), also it's not caching everything, just around 1000 of 2500 blocks.
> Maybe the cache is not impacting on performance, just the cassandra IO (?)
>
> I saw that people from ooyala can do analytics jobs in milliseconds (
> http://www.youtube.com/watch?v=6kHlArorzvs), any advices?
>
> Appreciate the help!
>
> Lucas.
>
> --
>
> Lucas Fernandes Brunialti
>
> *Dev/Ops Software Engineer*
>
> *+55 9 6512 4514*
>
> *lbrunialti@igcorp.com.br* <lbrunialti@igcorp.com.br>
>

Mime
View raw message