gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: accumulo backend for gora
Date Fri, 02 Dec 2011 15:36:03 GMT
On Thu, Dec 1, 2011 at 5:16 PM, Enis Söztutar <enis.soz@gmail.com> wrote:
> Wow, this is great news. If you upload the patch, I am sure there will be
> interest for review and we can add it to the code base.
> Coming to the array storage, one of the strengths of Gora is that it
> delegates the mapping to the data store, since every one has it's own data
> model. In HBas, and I believe in Accumulo as well, you can store arrays at
> least in three ways
>  (1) serialize the array and store it in one cell
>  - Adding deleting items will read and reserialize the whole array. This
> is perfect for small, mostly read only arrays.
>  (2) serialize each item in one cell sharing the same column family and
> having consecutive column numbers. Like family:0 -> arr[0],
> family:1->arr[1], ...
>  (3) serialize each item in columns sharing the same column family, but
> with empty calls. Like family:arr[0] -> 'dummy', family:arr[1], ... .
>  - The array elements will be stored in sorted order.
> So, the question is what to choose? It turns out that depending on how you
> want to access data and the characteristics of the data (like read-only,
> size, etc), you should be able to choose either of them for your fields.
> And depending on how you do the data layout in your storage, the semantics
> and/or the performance for the use case you mentioned can change. In HBase,
> we have only option (2), but ideally Gora-hbase and gora-accumulo should be
> able to work with all 3. And if you think about the deleting item from
> array semantics, it gets a little bit more involved. For example in
> gora-hbase, your use case will probably print d4,d5,d3 (since d1 and d2
> will be overriden, but d3 won't be deleted). However, I think the correct
> semantics should be only to print d4 and d5. However, if you go with (3), I
> think the correct semantics is to print d1,d2,d3,d4,d5.

Looking at the current HBase implementation, I thought it might yield
d4,d5,d3.  But I was not sure. I think with option 2 you could also
store a length or end or array marker, then just d4 and d5 would be
returned.  I was thinking of doing this for the Accumulo datastore,
but then its behavior would differ from the HBase store.  So what
should the behavior be?  Should different Gora stores have the same
behavior even if they have different implementations?  Seems like this
would be good for the gora user, makes it easier to switch between
implementations.  The behavior could be specified in the interfaces
and enforced via test.  Seems like there are already some test that
check for some behaviors across implementations.

> So, as I said, the "correct" semantics depends on the data model, and gora
> should be flexible enough so that we can utilize different models suitable
> for the job.
> Thanks,
> Enis
> On Thu, Dec 1, 2011 at 1:07 PM, Keith Turner <keith@deenlo.com> wrote:
>> I am have been writing an Accumulo [1]  backend for gora.  I am pretty
>> far along, but not finished.  When I am finished, I plan to post a
>> patch on a jira ticket.  If anyone would like to review it let me
>> know.
>> I have a question about storing arrays.  I am wondering what the
>> expected behavior is given the following?
>>  {
>>  "type": "record",
>>  "name": "Foo",
>>  "namespace": "test",
>>  "fields" : [
>>    {"name": "data","type": "array", "items": "string"}
>>  ]
>> }
>> Foo foo1 = new test.Foo();
>> foo1.addToData("d1");
>> foo1.addToData("d2");
>> foo1.addToData("d3");
>> datastore.put(42l, foo1);
>> datastore.flush();
>> Foo foo2 = new test.Foo();
>> foo2.addToData("d4");
>> foo2.addToData("d5");
>> datastore.put(42l, foo2);
>> datastore.flush();
>> Foo foo3 = datastore.get(42l);
>> System.out.println(foo3);  //what would you expect this to print for
>> the data array?  d4,d5?
>> [1]: http://incubator.apache.org/accumulo

View raw message