jclouds-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aled Sage (JIRA)" <j...@apache.org>
Subject [jira] [Created] (JCLOUDS-1187) Avoid excessive memory usage when processing massive http response message
Date Sun, 16 Oct 2016 12:18:20 GMT
Aled Sage created JCLOUDS-1187:

             Summary: Avoid excessive memory usage when processing massive http response message
                 Key: JCLOUDS-1187
                 URL: https://issues.apache.org/jira/browse/JCLOUDS-1187
             Project: jclouds
          Issue Type: Improvement
            Reporter: Aled Sage

With jclouds 1.9.3-SNAPSHOT and 2.0.0-SNAPSHOT (and all GA versions)...

If jclouds receives a crazily big http response (e.g. see https://github.com/jclouds/jclouds/pull/1020),
then jclouds consumes a huge amount of memory while processing that response. It holds multiple
copies of the response in-memory at the same time.

As reported in https://issues.apache.org/jira/browse/BROOKLYN-364, the memory usage (according
to {{jmap -histo:live <pid>}}) is most char arrays (which is what backs the StringBuilder
and String). When processing a 154MB response payload, the memory used by char arrays goes
something like this: 66MB -> 104MB -> 323MB -> 625MB -> 939MB -> 461MB ->
633MB -> 327MB -> 21MB.
(I don't know how much to believe the {{:live}}: is that definitely the size of char arrays
that could not be GC'ed?)

There are two main areas where this memory is consumed.

*Wire Log*

In the jclouds wire log, the following happens (approximately?!):

* We read the response input stream, writing it to a {{FileBackedOutputStream}} - nice!
* In {{Wire.wire(String, InputStream)}}, we read the {{FileBackedOutputStream}} into a {{StringBuilder}},
and then call {{getWireLog().debug(buffer.toString())}}:  
  The StringBuilder holds the {{char[]}}; the toString() duplicates it - so two copies in
  Unfortunately in the Softlayer example, it's all one huge line, so we logged it all in one
  I think (but need to dig into it more) that the logging framework (slf4j -> log4j in
my case) ends up with multiple copies as well, while processing the call to {{log.debug(String)}}.
(Hence peaking at 939MB of char arrays in memory).
  When the method returns, all these copies can be GC'ed.
* The response payload has now been switched to the {{FileBackedOutputStream}}, so that will
be used subsequently.

*Json Parsing*

To parse the HTTP response:

* The response is passed to {{org.jclouds.http.functions.ParseJson.apply(HttpResponse)}}
* This calls {{json.fromJson(Strings2.toStringAndClose(stream), type)}}.  
* The {{Strings2.toStringAndClose}} calls {{CharStreams.toString(new InputStreamReader(input,
  This reads the stream into a StringBuilder, then calls toString - so holds two copies in
  This explains the second spike in memory usage (though I'm surprised it went as high as
633MB of char arrays in memory).  
  When the method returns, we have our one String.  

*Possible Improvements to Wire Log*

{{org.jclouds.logging.internal.Wire}} could be configurable to only log the first X bytes
of a response (so crazily long messages would be truncated in the log).

Alternatively/additionally, {{Wire.wire}} could force splitting a huge line into multiple
log messages when calling {{getWireLog().debug(buffer.toString())}}. Again this could be configurable.

In production usage, I personally would always configure it to truncate: better to miss the
end of the response rather than risk an {{OutOfMemoryError}}. Note this particular part isn't
an issue if jclouds.wire is set to INFO or higher.

*Possible Improvements to Json Parsing*

I think {{ParseJson.apply(InputStream stream, Type type)}} should pass the {{new InputStreamReader(inputStream)}}
to {{org.jclouds.json.Json.fromJson()}} (in a try-finally to close the stream, obviously).

This would require adding a new method to the interface {{org.jclouds.json.Json}}.

The implementation of {{GsonWrapper}} can handle that easily: it can call {{gson.fromJson(Reader,
Type)}} instead of {{gson.fromJson(String, Type)}}. It looks to me like that will parse the
stream as it reads it, rather than having to hold the whole string in memory at once.

If we do these improvements, I don't think we'll ever hold the full 154MB char array in memory.

This message was sent by Atlassian JIRA

View raw message