james-server-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ofir Gross" <ofir_...@hotmail.com>
Subject RE: Memory, Large messages and SharedInputStreams (Was: Apache James / JAMES-134)
Date Mon, 23 Jan 2006 06:21:39 GMT

>Ofir Gross wrote:
>>Hi,
>>I share my thoughts:
>
>I appreciate this. Sorry If it takes so much to reply but I've been busy 
>with the spool synchronization problems and with my day-job.
>
>I'd like to keep the conversation on the dev-list because it is archived 
>for future reference and somebody else could eventually partecipate to the 
>brainstorming.
>
>>  There is a problem with the constructor:
>>
>>    public MimeMessage(MimeMessage source) throws MessagingException {
>>         [...]
>>         source.writeTo(bos);       <<----------Problem is here
>>[...]
>>Because it __always__ saves into a ByteArrayInputStream which is in 
>>memory. The only way to avoid it is either __not__ to call that 
>>constructor, or to subclass it like I did, in order to overide this 
>>method. When I have overiden this constructor, I saved to a temporary file 
>>instead of a byte array.
>
>Ok, that's clear, we must remove calls to that constructor from our code, 
>and we should use new MimeMessage(Session s, InputStream is).
>Furthermore we should pass a SharedInputStream to that constructor.

How is it done technically? How to do the conversion from 
MimeMessage(MimeMessage source) to MimeMessage(Session s, InputStream is) ?

>
>>You suggested to use the existing repository instead, which is a good 
>>idea. But two questions:
>>1. how do I locate the message in the repository form the MimeMessage 
>>"source"?
>
>I think this is not possible. You can get the source starting from the Mail 
>object in the repository but not viceversa.
>Why do you need that?

I wanted to use it in a subclass_of_MimeMessage that would override that 
constructor and there, in that overriding constructor, it would use the 
message in the repository in order to construct the new 
subclass_of_MimeMessage. This is instead of using a temporary file like I 
did (or a byte array like in the original MimeMessage constructor).  If it 
isn't possible now, it could be made possible in a subclass by putting a key 
field in that subclass, that would contain data that would serve as a  
reference to that message in the repository. Then in places where James 
constructs a new MimeMessage from the repository, it would instead construct 
the subclass, and then add to it the key that references that message in the 
repository. This way when that object would be passed in the future to such 
a constructor, it would contain a reference to its place in the repository, 
and so it could be constructed from the repository.

>
>>2. Does every MimeMessage exist in the repository?
>
>Not every mimemessage, because when we create a new mimemessage from 
>scratch (like the bounce mailet) we create the MimeMessage and after that 
>we send it to James that store it in the spool.
>BTW most time when we deal with big messages we probably received them from 
>the SMTPServer and we already have them in streamrepositories or 
>mailrepositories.


So I understand that some times it may happen that there would be a big 
message not in the repository. If that message would have a 
SharedInputStream contentStream, then no problem. If not, then it may be a 
problem. One way to deal with it is: If the message is immediately stored at 
the spool, then  its repository place reference in the subclass can be 
updated, and so it will have its place in the repository. If the spool 
storing doesn't happen immediately then a temporary file should be used. I 
guess the user of that class would need to tell it, if it should use a 
temporary file. Or have the spool write the repository reference in, when it 
saves the message in the repository, and the user will tell the subclass if 
he/she is done with it. Then the subclass would check itself, and if it is 
too big, and don't have a SharedInputStream, and is not in the repository, 
it will make a temp file. If this scheme would be used then in most cases 
there would be no temp file. But it still leaves room for improvement since 
I want to eliminate the temp file. An other way to deal with it, is that the 
users will make sure, that if the message is big, it will have a 
SharedInputstream contentStream.


>
>When the SMTPServer receive a new message it currently create a new 
>MailImpl using the "public MailImpl(String name, MailAddress sender, 
>Collection recipients, InputStream messageIn)" costructor.
>That constructor create a new MimeMessageSource using the InputStream from 
>the socket (new MimeMessageInputStreamSource(name, messageIn);)
>MimeMessageInputStreamSource currently store the message in a .m64 file and 
>eventually provide the stream to the following users:
>
>public synchronized InputStream getInputStream() throws IOException {
>    return new BufferedInputStream(new FileInputStream(file));
>}
>
>This is the first step.
>Maybe we should start changing the MimeMessageInputStreamSource to provide 
>a Shared input stream?!?

Yes, it looks like a good idea. There is already in javaMail 1.3.2 a 
com.sun.mail.util.SharedFileInputStream class. Version 1.4 has also a 
javax.mail.util.SharedFileInputStream.


The preceding method could be re-written as:

public synchronized InputStream getInputStream() throws IOException
{
   return new  SharedFileInputStream(file);
}

Very easy.

>
>Then, when a message is stored in a dbfile repository then the header is 
>stored in the body field of the repository db table while the body of the 
>message is stored in a streamrepository.
>
>MimeMessageJDBCSource is the object that handle this behaviour. So every  
>time we read a message from the db repository we use this object.
>We call this method of that object to get the InputStream:
>---
>public synchronized InputStream getInputStream() throws IOException
>---
>and it create a new SequenceInputStream using a ByteArrayInputStream of the 
>header and a the input stream provided by the streamrepository.get().
>---
>InputStream in = new ByteArrayInputStream(headers);
>if (sr != null) {
>    in = new SequenceInputStream(in, sr.get(key));
>}
>---
>The implementation of the streamrepository.get used is in 
>File_Persistent_Stream_Repository and you can see it uses a
>---
>final ResettableFileInputStream stream = new ResettableFileInputStream( 
>getFile( key ) );
>---
>
>Maybe we should work on the above files to always be able to provide shared 
>inputstreams.

Yes, it looks possible to me.

>
>
>>The other constructor: "public MimeMessage(Session session, InputStream 
>>is)" will be OK if the InputStream provided to it will be a 
>>SharedInputStream, and that can be done by tracking down calls to this 
>>constractor (by grep "new MimeMessage"), and wrapping up the provided 
>>input stream with a SharedInputStream implementing wrapper. I did that 
>>grep and looked at the results, and this looks possible.
>
>The main one is on the MimeMessageWrapper.loadMessage()
>----
>in = source.getInputStream();
>headers = loadHeaders(in);
>
>ByteArrayInputStream headersIn = new 
>ByteArrayInputStream(headers.toByteArray());
>in = new SequenceInputStream(headersIn, in);
>
>message = new MimeMessage(session, in);
>----
>source is the MimeMessageJDBCSource we already seen before and we could 
>change it to provide a SharedInputStream (to avoid a copy to a new file)
>You see a similar pattern as before: we create a new SequenceInputStream 
>from a ByteArrayInputStream and the stream provided by the source.
>We should find a way to have a SharedInputStream at the end of all this 
>steps, without the need to copy the stream on a new file.
>
>We could introduce your proposed wrapped only in a few mailet but we should 
>avoid using it in our core because we already have 2 wrappers over the 
>mimemessage and we can get better/cleaner result following the path I 
>describe in this mail.

If there are too many wrappers already, maybe one of them 
(MimeMessageWrapper?) could  be modified so it would override 
MimeMessage(MimeMessage source) constructor, and save a key reference to the 
message in the repository, and so no new class would be introduced.


>
>>Maybe if all InputStreams provided to that constructor will be 
>>SharedInputStream, then all MimeMessage will have a SharedInputStream 
>>implementing contentStream as well? If it is true, then the overiding 
>>constructor of "MimeMessage(MimeMessage source)" could use that stream, 
>>and it will not need to locate the message in the repository. But I am not 
>>sure wether it is true, because the constructor "MimeMessage(Session 
>>session)" don't touch content, or contentStream, and so if it is 
>>constructed that way, it will not have either, and so there will be no 
>>InputStream for it, in case it is sent this way to the 
>>MimeMessage(MimeMessage source) constructor.
>
>When the message is constructed from someone else using the new 
>MimeMessage(session) we have no power on how it is handled.
>IMHO we should start optimizing our own operations then we could write a 
>few docs on how to write optimized mailets.
>
>A further step would be to use streaming operations also for "db" only 
>repositories. We currently use blob operations that write and read full 
>bytearrays but most new dbs/jdbc drivers correctly supports the use of 
>streams for write and read operations of large contents. Here there would 
>be one more issue because they don't provide (obviously) SharedInputStreams 
>but their own InputStreams and I don't know how we could implement it. We 
>probably should use your proposed wrapper if we need that, or otherwise 
>create a wrapper over the previous InputStream implementing the 
>SharedInputStream and simply retreaving a new InputStream at every call to 
>newStream (returning, in turn, a new SharedInputStream).

I have seen that usage of byte arrays in the retrieval from the DB, and 
wondered why getInputStream was not used. Now I know. Well, if today 
DBs/drivers are ok, it is better to use it, since it will save RAM. As of 
the SharedInputStream version for it, a wrapper class can be made that would 
save the stuff that is needed for the stream retrieval from the DB, i.e. it 
will do the same thing that is done by the original method when it retrieves 
the message from the DB. Thus a new stream would be created, and the wrapper 
can save it in a member object. The start/end pair of newStream(start,end) 
can also be saved and be considered when the wrapper class forwards reads to 
the underlying InputStream member object. (So it will not allow reading out 
of the start/end boundaries)


>
>The worst part of this work is that it is very difficult to unit-test 
>memory issues: any idea on how to test it?

I don't know exactly what you mean, so what I write is probably not the 
right answer to it, but anyway here goes: if we have two versions, an old 
one that works and a new one that we test, we can make both versions produce 
an output file, and then compare the files (diff -s file1 file2). This 
output file may be constructed by adding calls to various methods of the 
tested class, and writing the results to the file. If the only thing 
different is one class, the difference would suggest an error in the new 
class. Since I don't know what you meant, maybe you could refer me to some 
existing unit-tests that you have, so I will have a better idea of what you 
meant.


>
>Stefano

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar  get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Mime
View raw message