mina-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Vermillard <jvermill...@gmail.com>
Subject Re: Minackathon feedback
Date Sat, 08 Jun 2013 19:00:57 GMT
Thanks for the long resume :)
Comments inline.

On Sat, Jun 8, 2013 at 10:40 AM, Emmanuel L├ęcharny <elecharny@gmail.com> wrote:
> Hi guys,
> we have spent a couple of days this week with Julien and Jeff during the
> EclipseCon working on MINA 3. We have experimented some things, did some
> benchmarks, and studied them. This is a short sum up of what we did and
> teh resuts we've got.
> 1) Performances
> We have done some tests with MINA 3 and Ntty 3 TCP. basically, we ran
> the benchmark code we have either locally (the client and the server on
> one machine) or with two machines (the server and the client on two
> machines). What it shows is that the difference between MINA3 (M3) and
> Netty3 (N3) varies with the size of the exchanged messages. M3 is
> slightly faster up to 100Kb messages, then N3 is faster up to 1Gb
> messages, then N3 is clearly having some pb.

You mean 1MB I think :)
Anyway, sending such a big message is a weird use case for me.
On my linux machine N3 is slower than M3 by 20% on small TCP messages.
And M3 provide idle detection but I'm not sure it was activated on N3.

> When we conduct tests with the server on one machine, and the client on
> another machine, we are CPU bound. On my machine, we can reach roughly
> 65 000 1kb messages per second (either with M3 or N3). There is no
> statistically relevent difference. The CPU is at 90%, with roughly 85%
> system, which means the CPU is busy processing the sockets, the impact
> of our own code is insignificant. Note that we have mesured reads, not
> writes.
> 2) Analysis
> One of the major diffeence between M3 and N3 is the buffer usage. There
> are 2 kind of buffers : direct and heap. The direct buffers are
> allocated outside the JVM, the heap buffers are allocated within the JVM
> memory. It's important to understand that only direct buffers will be
> written in a socket, so at some point, we must move the data into a
> direct buffer.
> So basically, we would like to push the message into a directBuffer as
> soon as possibel, like in the encoder. That means we have to allocate a
> DirectBuffer to do the job. It seems to be a smart idea, at first, but...
> There is a bug in the JVM : http://bugs.sun.com/view_bug.do?bug_id=4469299
> It says "In some cases, particularly with applications with large heaps
> and light to moderate loads where collections happen infrequently, the
> Java process can consume memory to the point of process address space
> exhaustion.". Bottom line, as soon as you have heavy allocations, you
> might get OOM, even for Direct Buffers.
> One more problem is that there is a physical limit on the size you can
> allocate, and it's defined by a parameter : -XX
> :MaxDirectMemorySize=<size>. It defaults to 64M in java 6, and the size
> you have set in -Xmx parameter. You can't get any farther. All in all,
> it's pretty much the same thing than for the Heap buffers. Assuming that
> allocatng a Direct Buffer is twice more expensive than a heap buffer
> (again, it depends on the Java version you are using), it's quite
> important not to allocate too many direct buffers.
> In order to work around the JVM bug, Sun is suggesting three possibilities :
> 1) Insert occasional explicit System.gc() invocations

We could do that only when Direct Memory allocation do OOM, that's
ugly but could works.

> 2) Reduce the size of the young generation to force more frequent GCs.
> 3) Explicitly pool direct buffers at the application level.
> N3 has implemented the third approach, which is expensive, and create a
> pb as soon as you send big messages, thus leading to the bad
> performances we have in this case in M3.
> We have a possible different approach : never allocate a direct buffer,
> always use a heap buffer. This will lead sot a penalty of 3% in
> performance, but this eliminate the pb.
> Calling the GC is simply not an option.
Perhaps we can call System.gc() at the first OOM during a direct
buffer allocation?
> 3) Write performances
> Writng data into a socke is tricky : we never know in advance how many
> bytes we will be able to write, and the data must be injected into a
> Direct buffer before it can be written into the socket. There are a few
> possible strategy :
> 1) write the heap buffer into the channel
> 2) write a direct buffer into the channel
> 3) get a chunk of the heap buffer, copy it into a direct buffer, and
> write it into the channel.
> In case (1), we delegate the copy of the buffer to the channel. If the
> heap buffer is huge, we might copy it many times, as the
> channel.write(buffer) will return the number of bytes written.
> Hopefully, the channel.write() will not copy the whole heap buffer into
> a huge direct buffer, but we have no way to control what it does
> In case (2), that means we allocate a huge direct buffer, and put
> everything into it. It has the advantage of being done only once, and we
> don't have to take care of what's going in in the write() method. But
> the main issue is that we will potentially hit the JVM bug
> In case (3), we can have an approach that tries to deal with both issues
> : we allocate a direct buffer that is associated with each thread - so
> only a few ones will be allocated - and we copy a maximum of bytes that
> is determinated by the socket sendBufferSize (roughly 64kb). We will
> then copy the data from the heap buffer to the dirct buffer at each
> round, and if everything goes well, we will just do the minimal number
> of copies. However, we may perfectly well have to copy the data many
> times, as the direct buffer might be shared with many other sessions.
> All in all, there is no perfect strategy here. We can imrove the third
> strategy by using an adaptative copy : as we know ha many bytes were
> written, we can limit the number of bytes we copy into the diretc buffer
> to the last few size that the socket was able to send.
> The important thing to remember is that we *have* to keep the buffer to
> send in a stack until it has been fully written, which may lead to some
> pb when the client are slow readers, and when the server has many client
> to serve.
> 5) Selectors
> There is no measurable differences on the server if we use one single
> selector or many. It seems that most of the time is consummed in the
> select() method, no matter what. The original dsign where we created
> many selectors (as many as we have processors, plus one) seems to be
> based on some urban legend, or at least, based on Java 4. We have to
> reasset this design.

I think you was staturating the network. adding more selector loop would help
to saturate more CPU core with more clients.

> 4) Conclusion
> We have more tests to conduct. This is not simple, it all depends on the
> JVM we are running the server on, and many of the aspects may be configured.
> The next steps would be to conduct tests with the various scenarii, on
> different JVM, with different size. We may need to design a plugable
> system for handling the reads and the writes, we can use a factory for that.
> Bottom line, we also would like to compare a NIO based server with a BIO
> based server. I'm not sure that we have a big performance penalty with
> Java 7.
> Java 7 is way better than Java 6 in the way it handles buffers too.
> There is no reason to use Java 6 those days, it's dead anyway. It would
> be interesting to benchmark Java 8 to see what it brings.

Yep JDK7 proved to be really faster for buffer allocation, do you have the
byte buffer allocation micro-benchmark somewhere ?
I would like to run it on Linux.


View raw message