spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tj opensource <opensou...@tuplejump.com>
Subject Re: Memory footprint of Calliope: Spark -> Cassandra writes
Date Wed, 18 Jun 2014 06:12:35 GMT
Gerard,

We haven't done a test on Calliope vs a driver.

The thing is Calliope builds on C* thrift (and latest build on DS driver)
and the performance in terms of simple write will be similar to any
existing driver. But then that is not the use case for Calliope.

It is built to be used from Spark and to harness the distributed nature of
Spark. With a regular driver you would have to take care of multithreading,
splitting the data, etc. While with spark and Calliope this comes free.

Regards,
Rohit



On Tue, Jun 17, 2014 at 9:24 PM, Gerard Maas <gerard.maas@gmail.com> wrote:

> Hi Rohit,
>
> Thanks a lot for looking at this.  The intention of calculating the data
> upfront it to only benchmark the time it takes store in records/sec
> eliminating the generation factor from it (which will be different on the
> real scenario, reading from HDFS)
> I used a profiler today and indeed it's not the storage part, but the
> generation that's bloating the memory.  Objects in memory take surprisingly
> more space that one would expect based on the data they hold. In my case it
> was 2.1x the size of the original data.
>
> Now that  we are talking about this, do you have some figures of how
> Calliope compares -performance wise- to a classic Cassandra driver
> (DataStax / Astyanax) ?  that would be awesome.
>
> Thanks again!
>
> -kr, Gerard.
>
>
>
>
>
> On Tue, Jun 17, 2014 at 4:27 PM, tj opensource <opensource@tuplejump.com>
> wrote:
>
>> Dear Gerard,
>>
>> I just tried the code you posted in the gist (
>> https://gist.github.com/maasg/68de6016bffe5e71b78c) and it does give a
>> OOM. It is cause of the data being generated locally and then paralellized
>> -
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>>     val entries = for (i <- 1 to total) yield {
>>
>>
>>
>>
>>       Array(s"devy$i", "aggr", "1000", "sum", (i to i+10).mkString(","))
>>
>>
>>
>>
>>     }
>>
>>
>>
>>     val rdd = sc.parallelize(entries,8)
>>
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>>
>>
>> This will generate all the data on the local system and then try to
>> partition it.
>>
>> Instead, we should paralellize the keys (i <- 1 to total) and generate
>> data in the map tasks. This is *closer* to what you will get if you
>> distribute out a file on a DFS like HDFS/SnackFS.
>>
>> I have made the change in the script here (
>> https://gist.github.com/milliondreams/aac52e08953949057e7d)
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>>     val rdd = sc.parallelize(1 to total, 8).map(i => Array(s"devy$i", "aggr",
"1000", "sum", (i to i+10).mkString(",")))
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>>
>>
>> I was able to insert 50M records using just over 350M RAM. Attaching the
>> log and screenshot.
>>
>> Let me know if you still face this issue... we can do a screen share and
>> resolve thee issue there.
>>
>> And thanks for using Calliope. I hope it serves your needs.
>>
>> Cheers,
>> Rohit
>>
>>
>> On Mon, Jun 16, 2014 at 9:57 PM, Gerard Maas <gerard.maas@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I've been doing some testing with Calliope as a way to do batch load
>>> from Spark into Cassandra.
>>> My initial results are promising on the performance area, but worrisome
>>> on the memory footprint side.
>>>
>>> I'm generating N records of about 50 bytes each and using the UPDATE
>>> mutator to insert them into C*.   I get OOM if my memory is below 1GB per
>>> million of records, or about 50Mb of raw data (without counting any
>>> RDD/structural overhead).  (See code [1])
>>>
>>> (so, to avoid confusions: e.g.: I need 4GB RAM to save  4M of 50Byte
>>> records to Cassandra)  That's an order of magnitude more than the RAW data.
>>>
>>> I understood that Calliope builds on top of the Hadoop support of
>>> Cassandra, which builds on top of SSTables and sstableloader.
>>>
>>> I would like to know what's the memory usage factor of Calliope and what
>>> parameters could I use to control/tune that.
>>>
>>> Any experience/advice on that?
>>>
>>>  -kr, Gerard.
>>>
>>> [1] https://gist.github.com/maasg/68de6016bffe5e71b78c
>>>
>>
>>
>

Mime
View raw message