gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kazuomi Kashii <kazu...@kashii.net>
Subject Re: [DISCUSS] gora-cassandra serialization spec
Date Mon, 09 Jul 2012 16:50:14 GMT
Alex, thanks for your comment on gora-cassandra serialization spec.

Actually, I had thought in the same way as your comment until I read the
source code of gora-cassandra.
When I read the source code, I noticed that it uses Hector's serializers
instead of Avro's one.
So, I developed new serializers to extends Hector's one for GORA-142
patches.

Here is another reasons why I'd prefer Hector's one to Avro's one.

1) Since Nutch 2.0 based on Gora 0.2 has been released,
it may be confusing and not compatible when Gora 0.3 is released with
Avro's serialization.

2) Cassandra supports the order and verification classes, and the
standard (bundled) ones are not compatible with Avro's serialization.

For instance, STRING "abc" is:
61 62 63 in Hector, and
00 00 00 03 61 62 63 in Avro;
and INT 3 is:
00 00 00 03 in Hector (fixed length, 4 bytes for INT)
05 in Avro (variable-length zig-zag)

Regarding Gora Hadoop job related parts in your comment, I have not
checked the detail  yet,
but it seems that Nutch 2.0 is working with the current stable version
of gora-cassandra 0.2 with Hector's serializers.
Also, I cannot find any code related to Hadoop in gora-cassandra, so I
thought that it should be handled in gora-core.
I will be checking that part, but if I am totally wrong, please correct me.

In summary,
* gora-cassandra 0.2 uses Hector's serialization; and
* gora-cassandra 0.3 is not compatible, if  Avro's serialization is
introduced in 0.3.

so my recommendation is to keep using Hector's serializers in
gora-cassandra 0.3 and later.

Again, I am not so familiar with other part such as Gora-HBase or Hadoop
related stuffs in Gora,
it would be very helpful to consider this issue from other perspectitve.

Regards,
-Kaz


On 7/8/12 1:38 PM, Alexis wrote:
> Hi,
>
> Thanks for improving the serialization of Avro types into Cassandra. I
> have not looked in the code how it's currently done for complex Avro
> types. If it now works for arrays, how do it work for complex hashes?
> Maybe a page in the wiki would help...
>
> I think the mistake I made when I first wrote the module was to map a
> string to a Cassandra column:
>
>> get["myRowKey"]["myStringName"];
> would return the actual string. Then in the same spirit I extended the
> storage of a hash with string values into a super column. Mapping
> tangible objects to Cassandra available datatypes is nice to have but
> in the end it's useless, since we need to serialize back to Avro in
> the Hadoop job anyway. Besides it add overhead.
>
> This has obvious limitations since we can not store complex types
> (starting with arrays) like in MongoDB, such as multiple level (>2)
> nested hash, like { "a" : { "b" : { "c" : "d" }}}.
>
> Gora Hadoop jobs rely on Avro "serialization protocol" to manipulate
> data. I was thinking we could simply store the serialized Avro object
> in binary format into a Standard Cassandra column.
>
> Then we would just need to put together an extension of cassandra-cli
> that deserializes the raw content into a human readable format, so
> that people can look at what's being stored with the get and get_slice
> Thrift calls.
>
> To summarize:
> - Stick with Avro to store object in binary format in Cassandra column
> - No super column families
> - A client that displays columns in Human readable way: Avro
> deserialization then some pretty print of the object if it's of
> complex type
>
> Alexis
>
> On Fri, Jul 6, 2012 at 11:40 AM, Kazuomi Kashii <kazuomi@kashii.net> wrote:
>> I wrote GORA-142-v3.patch that supports several new types of
>> serialization for gora-cassandra.
>> https://issues.apache.org/jira/browse/GORA-142
>>
>> Since I am not familiar with other implementation such as gora-hbase,
>> I'd like to hear your opinions on serialization spec, especially for
>> variable length array.
>>
>> Gora uses Avro for schema definition,
>> but I noticed that gora-cassandra uses its own serialization based Hector.
>> For instance, serialization of integer is totally different between Avro
>> (zig-zag) and gora-cassandra (Hector).
>> Considering Cassandra's pre-defined validation classes and comparators,
>> I think Hector's serialization is better than Avro's one at gora-cassandra,
>> so, my implementation of GORA-142 patch is based on Hector's serializers.
>>
>> For ARRAY support, I implemented GORA-138 patch first with Super CF in
>> the same way as RECORD or MAP.
>> https://issues.apache.org/jira/browse/GORA-138
>> As Enis mentioned at GORA-138, we may want another implementation with
>> single column for reasonably short arrays,
>> so GORA-142 patch supports ARRAY with single column implementation.
>>
>> For fixed length array, single column can store multiple elements just
>> adding them sequentially.
>> However, for variable length array such as STRING or BYTES,
>> it is impossible to retrieve each value if just values are stored
>> sequentially,
>> so GORA-142 patch implementation contains the size of element as INTEGER
>> before each actual value.
>> For instance, ["ABCDE", "abc", "1234"] is stored as
>> 00 00 00 05 41 42 43 44 45 00 00 00 03 61 62 63 00 00 00 04 31 32 33 34
>>
>> If there is no obligation, I will commit GORA-142 patch with above
>> serialization spec later once it is ready.
>>
>> Regards,
>> -Kaz
>>
>>
>> On 7/6/12 11:05 AM, Kazuomi Kashii wrote:
>>> +1 for new release
>>>
>>> 1) I committed GoraCompiler.java for GORA-143-v2.patch last night.
>>>   Since this is my first svn commit, I think it should be reviewed.
>>>
>>> 2) I have not committed GORA-142 patch for gora-cassandra yet.
>>>   Before that, I'd like to ask the team about serialization spec, so I
>>> will send another e-mail on this matter.
>>>   I am OK for new release without GORA-142 patch.
>>>   Depending on the discussion, I will commit GORA-142 before or after
>>> new release.
>>>
>>> -Kaz
>>>
>>>
>>> On 7/6/12 10:29 AM, Henry Saputra wrote:
>>>> +1 for new release, there are a lot fixes for Cassandra support so
>>>> should be good time for new release.
>>>>
>>>> 0.3 or 0.2.1?
>>>>
>>>> - Henry
>>>>
>>>> On Thu, Jul 5, 2012 at 11:06 PM, Mattmann, Chris A (388J)
>>>> <chris.a.mattmann@jpl.nasa.gov> wrote:
>>>>> +1 to roll release and I'll also throw my name into the hat to release
it.
>>>>>
>>>>> Let me know.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>>
>>>>> On Jul 5, 2012, at 12:30 PM, Lewis John Mcgibbney wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> As the GSoC project is moving along nicely and it's been some 3 or
so
>>>>>> months since the 0.2 release I was thinking about drumming up support
>>>>>> for another (possibly even 0.2.1) release?
>>>>>>
>>>>>> We have some 15 issues which have been addressed in the development
>>>>>> drive since 0.2 was released and I for one have not had quite as
much
>>>>>> time as I would have liked recently to put serious time into Gora.
>>>>>>
>>>>>> What do you guys think? I am more than happy to work as RM again
if required.
>>>>>>
>>>>>> Thank you in advance
>>>>>>
>>>>>> Lewis
>>>>>>
>>>>>> --
>>>>>> Lewis
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Senior Computer Scientist
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 171-266B, Mailstop: 171-246
>>>>> Email: chris.a.mattmann@nasa.gov
>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>> Phone: +1 (818) 354-8810
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>



Mime
View raw message