james-server-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Charles <e...@apache.org>
Subject Re: Project status and integration into James Mailbox
Date Thu, 16 Aug 2012 14:02:00 GMT
Hi Mihai,

I found the URL: https://github.com/mihaisoloi/Hbaluin.git
Seems like you are now using HBaseMiniCluster :)

I tried to install, and I am falling into HBASE-5711 (ubuntu here).
You can easily patch it such as I did with MAILBOX-185 (in the 
mailbox-hbase module).

After reading again the README, I think I've got now the idea of the 
test with lucene. I will try it.

Thx, Eric

On 08/16/2012 02:17 PM, Eric Charles wrote:
> Thx Mihai,
> Can you give us the github url?
> Do tests need an external hbase or does it run with HBaseMiniCluster?
> Btw, I tried to understand how to run the HBase lucene index in the
> Lucene tests following your README, but still don't get the idea... Any
> further comment?
> Thx, Eric
> On 08/16/2012 12:59 PM, Mihai Soloi wrote:
>> Hi all,
>> I've refactored the project on the apache-extras repository, and based it
>> on the James Mailbox Lucene implementation in order to parse the emails.
>> Not all the capabilities of the Lucene implementation are complete, and I
>> don't think I will finish implementing all of them till the end of GSoC,
>> but I will definetly continue to work on the project as I find it very
>> attractive.
>> Right now I am using what is basically an inverted index in an HBase
>> table.
>> The structure of the index is as follows.
>>     - mailboxID  is an java.util.UUID
>>     - the fields are now Enums, and what is stored is a byte that
>> identifies
>>     that enum field.
>>     - each of the terms in the fields are tokenized using the
>>     lucene org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer,
>> but some
>>     fields are not tokenized due to their nature(SENT_DATE for example)
>> The row is composed of all the above byte arrays concatenated, so that
>> searching can be done very fast through the HBase table, as well as
>> lookup
>> on the specific mailbox and field in the mail. The mailID is the
>> qualifier
>> in the static column family(only one column family) so that mail id's are
>> found with relative ease.
>> This is for the mail document in itself, the flags are stored in a single
>> row in the table(one row for each mailbox) and can be found easily by a
>> scan. Each of the rows now has an empty value, where in the possible
>> future
>> we'll be able to store data related to the term frequency in the
>> document.
>> What works currently are the searches based on the text, flags, headers,
>> all criterions. These are implemented using Filters but I will be
>> switching
>> to Coprocessors till next Monday due to the benefit they provide of less
>> data transfer over the network and distributed processing on each region.
>> I created a local branch on the main james-mailbox and I am currently
>> integrating the MailboxSearchIndexListener into the project to see a
>> run on
>> it.
>> I have been working on the github repository for it provides easier code
>> reviews from my mentor.
>> Please take a look at the project and tell me what you think. Any
>> input is
>> appreciated.
>> Mihai

eric | http://about.echarles.net | @echarles

To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

View raw message