james-server-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Wiederkehr" <markus.wiederk...@gmail.com>
Subject Re: [mime4j] Simple benchmark for testing performance of the MIME stream parser
Date Sun, 04 Jan 2009 12:56:13 GMT
On Sun, Jan 4, 2009 at 11:14 AM, Robert Burrell Donkin
<robertburrelldonkin@gmail.com> wrote:
> On Sat, Jan 3, 2009 at 8:15 PM, Markus Wiederkehr
> <markus.wiederkehr@gmail.com> wrote:
>> On Wed, Dec 24, 2008 at 7:21 PM, Oleg Kalnichevski <olegk@apache.org> wrote:
>>> Folks
>>>
>>> I took liberty to commit an ultra-simple benchmark I use for testing
>>> performance of the MIME stream parser.
>>>
>>> http://svn.apache.org/viewvc?view=rev&revision=729347
>>>
>>> Feel free to improve / extend / remove if useless.
>>
>> I have extended the class a bit. It is now possible to choose from
>> four different tests.
>>
>> Test 0 is the one Oleg wrote. It reads from a MimeTokenStream until
>> its end is reached.
>> Test 1 uses a MimeStreamParser and reports to an empty AbstractContentHandler.
>> Test 2 uses a MimeStreamParser and reports to an empty SimpleContentHandler.
>> Test 3 creates Message objects in memory.
>>
>> On my machine the results are:
>> Test 0: ~ 8 sec
>> Test 1: ~ 8 sec
>> Test 2: ~ 41 sec
>> Test 3: ~ 47 sec
>>
>> So it looks like parsing the header fields consumes about 80 percent of test 2.
>>
>> The difference between #2 and 3 is probably caused by copying the
>> message bodies into Storage objects.
>>
>> Maybe the header fields should be parsed lazily?
>
> IIRC there are a few wrinkles with this (at least some need to be
> parsed and some care need to be taken with folded values) but i think
> only structural headers really need to be parsed on the first pass.
>
>> Does anybody have a better idea?
>
> (this one isn't really a better idea but it's a little different so
> i'll throw it out there and see what happens...)
>
> the minimal useful MIME parser would read just the structural headers
> and the boundaries: dividing the stream into header lines and body
> parts without unnecessary parsing of the contents.

I think this already happens in a way. Look into DefaultBodyDescriptor
for example. There is a method parseContentType which determines the
boundary string from the content-type field. Note that this is used by
MimeTokenStream and has nothing to do with building a DOM.

Later when a Message objects gets built all header fields are parsed
_again_. Only this time a javacc generated parser is used. This is
where things get slow and this is what could be done lazily in my
opinion.

> the generalised use case i have in mind is streaming into storage.
> this use case occurs naturally when dealing with mail protocols but
> has other applications (for example, in CMRs).
>
> 1. a MIME message starts to be delivered to a socket
> 2. the protocol processor feeds the stream to a parser
> 3. the processor analyzes the boundaries streams head lines and body
> parts to permanent storage without unnecessary semantic parsing of the
> meta-data
> 4. when the message is complete, the processor continues to parse the
> incoming stream
>
> one problem with full DOMs (as used by JavaMail) is that large MIME
> documents are too big to fit in memory. this causes problems for
> protocols server. a structural DOM (maintaining at most meta-data in
> memory whilst allowing access to content through streams) backed by
> storage would be much more useful in this case.

But isn't that what we already have with Mime4j? The structure of the
message is kept in memory (including the header fields) whereas the
contents of text and binary parts are kept in Storage objects (on disk
or wherever).

The only thing is that base64 and quoted-printable parts are decoded
before they get stored. But I don't think this is necessarily the
number one performance problem.

The benchmark test message currently has 7-bit encoded body parts. I
have changed that to base64 to see what happens. It turns out that
test #3 now runs in about 55 seconds (47 before the change).

My interpretation (from the differences in runtime of the various
tests) is that from those 55 seconds 8 seconds are used up by
MimeTokenStream, about 6 seconds go into copying the body parts to
storage, 8 seconds are for base64 decoding and the remaining 33
seconds are for parsing header fields. That's about 60 percent.

And by parsing header fields I mean Field.parse(), not some parsing
already done in MimeTokenStream..

Of course this is only a rough estimate but I don't think I'm very far off.

Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Mime
View raw message