spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tj opensource <opensou...@tuplejump.com>
Subject Re: Memory footprint of Calliope: Spark -> Cassandra writes
Date Tue, 17 Jun 2014 14:27:40 GMT
Dear Gerard,

I just tried the code you posted in the gist (
https://gist.github.com/maasg/68de6016bffe5e71b78c) and it does give a OOM.
It is cause of the data being generated locally and then paralellized -

----------------------------------------------------------------------------------------------------------------------

    val entries = for (i <- 1 to total) yield {
      Array(s"devy$i", "aggr", "1000", "sum", (i to i+10).mkString(","))
    }


    val rdd = sc.parallelize(entries,8)

----------------------------------------------------------------------------------------------------------------------

This will generate all the data on the local system and then try to
partition it.

Instead, we should paralellize the keys (i <- 1 to total) and generate data
in the map tasks. This is *closer* to what you will get if you distribute
out a file on a DFS like HDFS/SnackFS.

I have made the change in the script here (
https://gist.github.com/milliondreams/aac52e08953949057e7d)

----------------------------------------------------------------------------------------------------------------------


    val rdd = sc.parallelize(1 to total, 8).map(i => Array(s"devy$i",
"aggr", "1000", "sum", (i to i+10).mkString(",")))
----------------------------------------------------------------------------------------------------------------------

I was able to insert 50M records using just over 350M RAM. Attaching the
log and screenshot.

Let me know if you still face this issue... we can do a screen share and
resolve thee issue there.

And thanks for using Calliope. I hope it serves your needs.

Cheers,
Rohit


On Mon, Jun 16, 2014 at 9:57 PM, Gerard Maas <gerard.maas@gmail.com> wrote:

> Hi,
>
> I've been doing some testing with Calliope as a way to do batch load from
> Spark into Cassandra.
> My initial results are promising on the performance area, but worrisome on
> the memory footprint side.
>
> I'm generating N records of about 50 bytes each and using the UPDATE
> mutator to insert them into C*.   I get OOM if my memory is below 1GB per
> million of records, or about 50Mb of raw data (without counting any
> RDD/structural overhead).  (See code [1])
>
> (so, to avoid confusions: e.g.: I need 4GB RAM to save  4M of 50Byte
> records to Cassandra)  That's an order of magnitude more than the RAW data.
>
> I understood that Calliope builds on top of the Hadoop support of
> Cassandra, which builds on top of SSTables and sstableloader.
>
> I would like to know what's the memory usage factor of Calliope and what
> parameters could I use to control/tune that.
>
> Any experience/advice on that?
>
>  -kr, Gerard.
>
> [1] https://gist.github.com/maasg/68de6016bffe5e71b78c
>

Mime
View raw message