spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: Memory footprint of Calliope: Spark -> Cassandra writes
Date Tue, 17 Jun 2014 16:33:06 GMT
Gerard,

Strings in particular are very inefficient because they're stored in a
two-byte format by the JVM.  If you use the Kryo serializer and have use
StorageLevel.MEMORY_ONLY_SER then Kryo stores Strings in UTF8, which for
ASCII-like strings will take half the space.

Andrew


On Tue, Jun 17, 2014 at 8:54 AM, Gerard Maas <gerard.maas@gmail.com> wrote:

> Hi Rohit,
>
> Thanks a lot for looking at this.  The intention of calculating the data
> upfront it to only benchmark the time it takes store in records/sec
> eliminating the generation factor from it (which will be different on the
> real scenario, reading from HDFS)
> I used a profiler today and indeed it's not the storage part, but the
> generation that's bloating the memory.  Objects in memory take surprisingly
> more space that one would expect based on the data they hold. In my case it
> was 2.1x the size of the original data.
>
> Now that  we are talking about this, do you have some figures of how
> Calliope compares -performance wise- to a classic Cassandra driver
> (DataStax / Astyanax) ?  that would be awesome.
>
> Thanks again!
>
> -kr, Gerard.
>
>
>
>
>
> On Tue, Jun 17, 2014 at 4:27 PM, tj opensource <opensource@tuplejump.com>
> wrote:
>
>> Dear Gerard,
>>
>> I just tried the code you posted in the gist (
>> https://gist.github.com/maasg/68de6016bffe5e71b78c) and it does give a
>> OOM. It is cause of the data being generated locally and then paralellized
>> -
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>>     val entries = for (i <- 1 to total) yield {
>>
>>
>>
>>
>>       Array(s"devy$i", "aggr", "1000", "sum", (i to i+10).mkString(","))
>>
>>
>>
>>
>>     }
>>
>>
>>
>>     val rdd = sc.parallelize(entries,8)
>>
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>>
>>
>> This will generate all the data on the local system and then try to
>> partition it.
>>
>> Instead, we should paralellize the keys (i <- 1 to total) and generate
>> data in the map tasks. This is *closer* to what you will get if you
>> distribute out a file on a DFS like HDFS/SnackFS.
>>
>> I have made the change in the script here (
>> https://gist.github.com/milliondreams/aac52e08953949057e7d)
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>>     val rdd = sc.parallelize(1 to total, 8).map(i => Array(s"devy$i", "aggr",
"1000", "sum", (i to i+10).mkString(",")))
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>>
>>
>> I was able to insert 50M records using just over 350M RAM. Attaching the
>> log and screenshot.
>>
>> Let me know if you still face this issue... we can do a screen share and
>> resolve thee issue there.
>>
>> And thanks for using Calliope. I hope it serves your needs.
>>
>> Cheers,
>> Rohit
>>
>>
>> On Mon, Jun 16, 2014 at 9:57 PM, Gerard Maas <gerard.maas@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I've been doing some testing with Calliope as a way to do batch load
>>> from Spark into Cassandra.
>>> My initial results are promising on the performance area, but worrisome
>>> on the memory footprint side.
>>>
>>> I'm generating N records of about 50 bytes each and using the UPDATE
>>> mutator to insert them into C*.   I get OOM if my memory is below 1GB per
>>> million of records, or about 50Mb of raw data (without counting any
>>> RDD/structural overhead).  (See code [1])
>>>
>>> (so, to avoid confusions: e.g.: I need 4GB RAM to save  4M of 50Byte
>>> records to Cassandra)  That's an order of magnitude more than the RAW data.
>>>
>>> I understood that Calliope builds on top of the Hadoop support of
>>> Cassandra, which builds on top of SSTables and sstableloader.
>>>
>>> I would like to know what's the memory usage factor of Calliope and what
>>> parameters could I use to control/tune that.
>>>
>>> Any experience/advice on that?
>>>
>>>  -kr, Gerard.
>>>
>>> [1] https://gist.github.com/maasg/68de6016bffe5e71b78c
>>>
>>
>>
>

Mime
View raw message