james-server-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ioan Eugen Stan (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAILBOX-170) Store mailboxes in HDFS SequenceFile
Date Thu, 01 Mar 2012 12:52:00 GMT

    [ https://issues.apache.org/jira/browse/MAILBOX-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220006#comment-13220006

Ioan Eugen Stan commented on MAILBOX-170:

Hello Eric, long post ahead :)

First, could you please explain more what you meant about efficiently query the mailbox? I
don't follow. 

Second, I don't believe a pure HBase implementation is the best. Let me explain why: HBase
can't handle large emails and storing them inside Base will lead to performance issues (i
have some experience with this while working for my current employer). That's why I'm planning
to move the message  implementation to HDFS.  

Basically I wish to create an mbox on steroids -> replicated mbox that can provide indexed
access to messages. I plan to store mailboxes as SequanceFiles and store in HBase the offset
of the key-value pair that stores the message. 

Message additions will be appends and we will use ZK locking to sync write access between
multiple instances of James. Deletes will be instant markers + MR jobs that do permanent clean-up:
create a copy of the old file with just the messages that are not deleted + update the references
in HBase. Reads will be done by opening the file do a seek and retrieve the message. I plan
to mimic in HBase the hadoop MapFile. I don't wish to use the MapFile directly because it
uses two files instead of one (each file uses 150 bytes or RAM + one block, so not good with
millions of mailboxes, especially when we have HBase).  All the metadata will be stored in
HBase like it is now, for fast access, the same will be (maybe) for message headers.

Messages will be stored with UID as key (they are ascending) and this means we can also iterate
over them for bulk loads.
Also, because a file is stored in HDFS and replicated, we can have good performance since
readers can access it from many nodes. I have to see the messages access pattern to optimize
this. replication is done per file so we can replicate frequent accessed mailboxes more times
than usual => good performance on reads because we can read in parallel => they are
immutable ;). 

I plan to implement a special type of Writable that will allow us to stream the message from
HBase and avoid loading all the message in memory. BytesWritable is fine for start, but uses
readFully to load the whole value of a sequence file == our message so big messages will cause

I plan to use the hadoop FileSystem class so we will use the distribuited filesystem HBase
will use => this means the implementation could run on any distribuited fs supported by

I also think HBase is intimately tied with Hadoop and things will not change in the near future
so not taking advantage of that is kind of a dumb thing to do. 

Basically that's all, with enough free time I think we can make James run in clustering. 


> Store mailboxes in HDFS SequenceFile
> ------------------------------------
>                 Key: MAILBOX-170
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-170
>             Project: James Mailbox
>          Issue Type: Improvement
>          Components: hbase
>    Affects Versions: 0.4
>            Reporter: Ioan Eugen Stan
>            Assignee: Ioan Eugen Stan
>             Fix For: 0.5
> The current implementation stores messages directly in HBase. I believe a better approach
is to store the messages as SequenceFiles in the <mail_ID>: <message_data>. HBase
will store sequence File offests in the SequenceFile for each mailbox for fast access similar
to a hadoop MapFile.  

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

View raw message