Hey Lucas,

Could you provide some rough psuedo-code for your job? One question is: are you loading the data from cassandra every time you perform an action, or do you cache() the dataset first? If you have a dataset that's already in an RDD, it's very hard for me to imaging that filters and aggregations could possibly take 4 minutes... should be more like seconds.

- Patrick

On Mon, Oct 28, 2013 at 9:11 AM, Lucas Fernandes Brunialti <lbrunialti@igcorp.com.br> wrote:

We're using Spark to run analytics and ML jobs against Cassandra. Our analytics jobs are simple (filters and counts) and we're trying to improve the performance, these jobs takes around 4 minutes querying 160Gb (size of our dataset). Also, we use 5 workers and 1 master, EC2 m1.xlarge with 8gb in jvm heap.

We tried to increase the jvm heap to 12gb, but we had no gain in performance. We're using CACHE_ONLY (after some tests we've found it better), also it's not caching everything, just around 1000 of 2500 blocks. Maybe the cache is not impacting on performance, just the cassandra IO (?)

I saw that people from ooyala can do analytics jobs in milliseconds (http://www.youtube.com/watch?v=6kHlArorzvs), any advices?

Appreciate the help!



Lucas Fernandes Brunialti

Dev/Ops Software Engineer

+55 9 6512 4514