james-server-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Charles <e...@apache.org>
Subject Re: GSoC: Avro Serialization over HBase
Date Tue, 12 Jun 2012 08:30:34 GMT
Hi Mihai,

Glad to hear your exams are over (I hope they went fine) :)

As Ioan said, Avro serialization HBase will be deprecated in favor of 
Protobuf (if I understand well...).

I also like Avro because it gives you serialization & storage format in 
one box, but is this what we want? The key point here is more an 
effective access to the persisted data.

There has been a few tentatives so far to marry HBase and Lucene (see 
[1], [2], [3] and [4] for example, see also [5] for a more recent article).

The questions I am wondering:

1. Will you focus on a 'generic' solution (reusable outside James), or 
on a very specific one tuned/optimized only for James mailbox needs?

2. What strategy will you take (custom Directory or custom 
IndexReader/Writer, usage of Coprocessor or not...)?

It would be good you sketch the answers on your MAILBOX-173 with a 
little architecture diagram (and also take back in MAILBOX-173 the 
useful information I read on 
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/mihaisoloi/1).

Thx,
Eric

[1] https://github.com/akkumar/hbasene
[2] https://github.com/thkoch2001/lucehbase
[3] https://github.com/jasonrutherglen/HBASE-SEARCH
[4] https://github.com/jasonrutherglen/LUCENE-FOR-HBASE
[5] http://www.infoq.com/articles/LuceneHbase

On 06/11/2012 08:01 PM, Mihai Soloi wrote:
> On 11.06.2012 20:49, Ioan Eugen Stan wrote:
>> Hi Mihai,
>>
>> After a quick look...
>>
>> 2012/6/11 Mihai Soloi <mihai.soloi@gmail.com>:
>>> Hello Eugen and everybody on the list,
>>>
>>> I've completed my exams but I've also done some work on the project,
>>> lately
>>> I've been reading up on the HBase API and AVRO API specifications[1]
>>> so that
>>> I can get to know them better.
>>>
>>> If you need to store AVRO objects, basically, arrays of bytes, into
>>> HBase
>>> then you would need to store a schema with the data, for example in the
>>> header of the file, so that you can later read it, if the schema changes
>>> radically over time. Ofcourse AVRO does support some of extension to
>>> modifying it's schemas, if you would look at my test code[0] you'd
>>> see that
>>> I was able to extend an existing schema, and prove that it does work
>>> with
>>> backward compatibility, I've followed Boris Lublinky's article[4] on
>>> using
>>> AVRO to get more familiar with it.
>> Great. It's nice to experiment.
>>
>>> I've encountered a situation in which I do want to store my data through
>>> AVRO on HBase(due to less memory and structured format and HBase
>>> integration) and I see that there is a class on
>>> "org.apache.hadoop.hbase.avro" like AvroServer which basically starts
>>> up a
>>> server through which all sorts of clients can interact with the data
>>> store,
>>> and also generated classes(e.g. AColumnValues, APut, AGet, etc.). These
>>> classes from what it would appear in my mind are used to translate the
>>> requests to the server into HBase Puts and Gets by also using the
>>> AvroUtils
>>> but I don't know if this is the way to go.
>> AvroServer is deprecated in 0.94 and scheduled to be removed in 0.96
>> (https://issues.apache.org/jira/browse/HBASE-5948). AvroServer
>> handles the RPC service to use Avro instead of Writables.
>>
>> Serialization = save an object to disk/file/network and load it in
>> memory again in the same way (deserialization). We need to
>> serialize/de-serialize a lucene index into HBase in an efficient way
>> (we care about indexing speed, search speed and how much disk/ram it's
>> going to cost us).
>>
>> Please read
>> http://stackoverflow.com/questions/2486721/what-is-a-data-serialization-system
>>
>> .
>>
>>> Another thing I've been considering is using Sam Pullara's HAvroBase
>>> implementation[2] and code on github[3]. Sam proposes storing only a
>>> hashcode of the schema and schemas stored separately. HAvroBase is
>>> much more
>>> than I would need as it also supports mysql, mongoDB, etc. So I could
>>> use
>>> only the storing part for the Lucene IndexWriter.
>> I think HAvroBase does a bit more than what we need. It's a bit
>> generic and I think we can do without adding it as a dependency. The
>> Lucene index format is not likely to change that much.
>>
>>> Another way to go is to assume that there will never be a change in the
>>> object schemas and just store data just the way it is. This is dangerous
>>> because if there is a change, we would have to change code, instead of a
>>> simple JSON schema.
>> The way Lucene stores the postings list is pretty standard and will
>> probably not change that much. I think using Avro is enough.
>>
>>> [0]
>>> http://code.google.com/a/apache-extras.org/p/mailbox-lucene-index-hbase/source/browse/LuceneTest/src/test/java/org/apache/james/mailbox/lucene/avro/AvroInheritanceTest.java
>>>
>>> [1] http://avro.apache.org/docs/current/spec.html
>>> [2]
>>> http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/
>>>
>>> [3] https://github.com/spullara/havrobase
>>> [4]
>>> http://www.infoq.com/articles/ApacheAvro;jsessistore that
>>> monid=6A801F1882512F455322B572F4B69E24
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
>

-- 
eric | http://about.echarles.net | @echarles

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Mime
View raw message