james-server-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Norman Maurer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAILBOX-44) [gsoc2011] Design and implement a distributed mailbox using Hadoop
Date Wed, 15 Jun 2011 05:51:48 GMT

    [ https://issues.apache.org/jira/browse/MAILBOX-44?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049618#comment-13049618

Norman Maurer commented on MAILBOX-44:


First of welcome :)

I wrote a few of the other mailbox implementations in JAMES. So maybe I can answer your questions
(concerns) ;) I also wrote a prototype for a mailbox on top of cassandra which is not to different
in terms of "limitations".

So here we go:

I think putting all the mail in one row for a mailbox will not work. As really big mailboxes
are quite common these days. This will just limit the distribution a lot (as you already pointed
out). So let me try to explain how I did the schema for cassandra maybe it also fits for hbase
(I had not the time to dig deeper into it).

* one row for the mailbox meta data (mailboxId, uidvalidity, namespace, username ...). 
* one row for the message metadata ( mailboxId, uid, size, headers, flags, messagecontentId...).

* one row per message content where I splitted the messagecontent in 1mb parts and put each
"raw" byte[] in a new column. This makes sure we don't get to big column (not sure if this
is also needed for hbase, in cassandra big columns are a problem)

For queries there a the following:
* retrieve all messages which have the recent flag set
* retrieve all messages which have the sent flag set
* retrieve all messages with uid <=> X
* retrieve all messages with the deleted flag set
* retrieve all mailboxes with name like '%X%'

Then IMAP also allows to build your own search query. Which is really problematic with nosql
stores or even if sql stores. As it mainly allow the user todo any kind of filtering, which
in fact just suck when you don't have the indexes set. So we have a lucene index for that
atm. I plan to write one in SOLR too.

Threading is not supported atm but is on my todo list.

Hope this helps, just ask if you need more infos

> [gsoc2011] Design and implement a distributed mailbox using Hadoop
> ------------------------------------------------------------------
>                 Key: MAILBOX-44
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-44
>             Project: James Mailbox
>          Issue Type: New Feature
>            Reporter: Eric Charles
>            Assignee: Norman Maurer
>              Labels: gsoc2011
>             Fix For: 0.3
> Context: The mailbox subproject (http://james.apache.org/mailbox/) supports maildir,
SQL database (via JPA) and Java Content Repository (JCR) as technology for mail storage. This
flexibility is achieved thanks to a API design that abstracts mail storage from the mail protocols.
> Task: We need to implement mailbox storage as a distributed system on top of Hadoop HDFS.
The James mailbox API will be used. A first step is to design how to interact with Hadoop
(native api, gora incubator at apache,...) and deal with specific performance questions related
to mail loading/parsing in a distributed system (use map/reduce or not, use existing local
lucene indexes for search,...). The second step is to implement the HDFS mailbox (maildir
mailbox is similar because is stores mails as a file and can be an inspiration). A single
James server will still be deployed because we don't have any distributed UID generation.
> Mentor: eric at apache dot org
> Complexity: medium 

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

View raw message