gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Fwd: some guidance needed
Date Thu, 19 May 2011 19:02:10 GMT
CC to gora-dev...

Begin forwarded message:

> From: Robert Burrell Donkin <robertburrelldonkin@gmail.com>
> Date: May 19, 2011 11:53:16 AM PDT
> To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
> Subject: Re: some guidance needed
> Reply-To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
> On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan <stan.ieugen@gmail.com> wrote:
>> I have forwarded this discussion to my mentors so they are informed
> (I've hopped onto this list so no need to remember to copy me into the
> thread ;-)
> <snip>
>> Eric, one of my mentors, suggested I use Gora for
>> this and after a quick look at Gora I saw that it is an ORM for HBase
>> and Cassandra which will allow me switch between them. The downside
>> with this is that Gora is still incubating so a piece of advice about
>> using it or not is welcomed. I will also ask on the Gora mailing list
>> to see how things are there.
> (I suspect there will be a measure of experimentation required in this
> project, so don't be afraid to try a spike or two)
>>>> I would encourage you to look at a system like HBase for your mail
>>>> backend. HDFS doesn't work well with lots of little files, and also
>>>> doesn't support random update, so existing formats like Maildir
>>>> wouldn't be a good fit.
> (Apache James closer to the Microsoft Exchange space than traditional
> *nix mail user agents)
>> I don't think I understand correctly what you mean by random updates.
>> E-mails are immutable so once written they are not going to be
>> updated. But if you are referring to the fact that lots of (small)
>> files will be written in a directory and that this can be a problem
>> then I get it. This will also mean that mailbox format (all emails in
>> one file) will be more inappropriate than Maildir. But since e-mails
>> are immutable and adding a mail to the mailbox means appending a small
>> piece of data to the file this should not be a problem if Hadoop has
>> append.
> Essentially, there are two classes of data that mail storage requires
> 1. read only MIME documents (mail messages) embedding meta-data (headers)
> 2. read-write meta-data sets about each document including flags for
> each (virtual) mail directory containing the document
> The documents are searched rarely. The meta-data sets are read often
> but written rarely.
> I suspect that emails are relatively small in Hadoop terms, and are
> often numerous. Might be interesting to see how a tuned HDFS instance
> performs when storing large numbers of small MIME documents. Should be
> easy enough to set up an experiment to benchmark. (I wonder whether a
> RESTful distributed storage solution might end up working better.)
> I suspect that the read-write meta-data sets will need HBase (or
> Cassandra). Would need to think carefully about design, I think.
>> The presentation on Vimeo it stated that HDFS 0.19 did not had append,
>> I don't know yet what is the status on that, but things are a little
>> brighter. You could have a mailbox file that could grow to a very
>> large size. This will lead to all the users emails into one big file
>> that is easy to manage, the only thing that it's missing is the
>> fetching the emails. Since emails are appended to the file (inbox) as
>> they come, and you usually are interested in the latest emails
>> received you could just read the tail of the file and do some indexing
>> based on that.
> I'm not hopeful about adopting an append based approach. (Might be
> made to work but I suspect that the locking required for IMAP or POP3
> is likely to kill performance.)
> Robert

Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

View raw message