james-server-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mihai Soloi <mihai.so...@gmail.com>
Subject GSoC: Avro Serialization over HBase
Date Mon, 11 Jun 2012 16:43:12 GMT
Hello Eugen and everybody on the list,

I've completed my exams but I've also done some work on the project, 
lately I've been reading up on the HBase API and AVRO API 
specifications[1] so that I can get to know them better.

If you need to store AVRO objects, basically, arrays of bytes, into 
HBase then you would need to store a schema with the data, for example 
in the header of the file, so that you can later read it, if the schema 
changes radically over time. Ofcourse AVRO does support some of 
extension to modifying it's schemas, if you would look at my test 
code[0] you'd see that I was able to extend an existing schema, and 
prove that it does work with backward compatibility, I've followed Boris 
Lublinky's article[4] on using AVRO to get more familiar with it.

I've encountered a situation in which I do want to store my data through 
AVRO on HBase(due to less memory and structured format and HBase 
integration) and I see that there is a class on 
"org.apache.hadoop.hbase.avro" like AvroServer which basically starts up 
a server through which all sorts of clients can interact with the data 
store, and also generated classes(e.g. AColumnValues, APut, AGet, etc.). 
These classes from what it would appear in my mind are used to translate 
the requests to the server into HBase Puts and Gets by also using the 
AvroUtils but I don't know if this is the way to go.

Another thing I've been considering is using Sam Pullara's HAvroBase 
implementation[2] and code on github[3]. Sam proposes storing only a 
hashcode of the schema and schemas stored separately. HAvroBase is much 
more than I would need as it also supports mysql, mongoDB, etc. So I 
could use only the storing part for the Lucene IndexWriter.

Another way to go is to assume that there will never be a change in the 
object schemas and just store data just the way it is. This is dangerous 
because if there is a change, we would have to change code, instead of a 
simple JSON schema.

[1] http://avro.apache.org/docs/current/spec.html
[3] https://github.com/spullara/havrobase

To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

View raw message