spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lucas Fernandes Brunialti <lbrunia...@igcorp.com.br>
Subject Job duration
Date Mon, 28 Oct 2013 16:11:09 GMT
Hello,

We're using Spark to run analytics and ML jobs against Cassandra. Our
analytics jobs are simple (filters and counts) and we're trying to improve
the performance, these jobs takes around 4 minutes querying 160Gb (size of
our dataset). Also, we use 5 workers and 1 master, EC2 m1.xlarge with 8gb
in jvm heap.

We tried to increase the jvm heap to 12gb, but we had no gain in
performance. We're using CACHE_ONLY (after some tests we've found it
better), also it's not caching everything, just around 1000 of 2500 blocks.
Maybe the cache is not impacting on performance, just the cassandra IO (?)

I saw that people from ooyala can do analytics jobs in milliseconds (
http://www.youtube.com/watch?v=6kHlArorzvs), any advices?

Appreciate the help!

Lucas.

-- 

Lucas Fernandes Brunialti

*Dev/Ops Software Engineer*

*+55 9 6512 4514*

*lbrunialti@igcorp.com.br* <lbrunialti@igcorp.com.br>

Mime
View raw message