tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down
Date Thu, 08 Dec 2016 13:21:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15732196#comment-15732196
] 

Tim Allison commented on TIKA-2180:
-----------------------------------

If I don't read from the response, after a few files, I get this:
{noformat}
SEVERE: Problem with writing the data, class org.apache.tika.server.resource.TikaResource$4,
ContentType: text/plain
Dec 08, 2016 8:13:56 AM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogging
WARNING: Interceptor for {http://resource.server.tika.apache.org/}TikaResource has thrown
exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
	at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:64)
	at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
	at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
	at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
	at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
	at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
	at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
	at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at org.eclipse.jetty.server.Server.handle(Server.java:366)
	at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
	at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
	at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:957)
	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
	at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
	at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
	at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.eclipse.jetty.io.EofException
	at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)
	at org.eclipse.jetty.server.AbstractHttpConnection.flushResponse(AbstractHttpConnection.java:686)
	at org.eclipse.jetty.server.AbstractHttpConnection$Output.close(AbstractHttpConnection.java:1108)
	at org.apache.cxf.transport.http_jetty.JettyHTTPDestination$JettyOutputStream.close(JettyHTTPDestination.java:332)
	at org.apache.cxf.transport.http.AbstractHTTPDestination$WrappedOutputStream.close(AbstractHTTPDestination.java:790)
	at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56)
	at org.apache.cxf.transport.http.AbstractHTTPDestination$BackChannelConduit.close(AbstractHTTPDestination.java:720)
	at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:62)
	... 24 more
Caused by: java.io.IOException: An established connection was aborted by the software in your
host machine
	at sun.nio.ch.SocketDispatcher.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:51)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
	at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:293)
	at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:404)
	at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:341)
	at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:378)
	at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:841)
	... 31 more
{noformat}

However, if I do something like this:
        for (int i = 0; i < 20; i++) {
            try (InputStream is = TikaInputStream.get(new File("C:/data/test_in/docx/Document
(1) - Copy.docx"))) {
                Response response = WebClient.create(endPoint + TIKA_PATH)
                        //.type("application/rtf")
                        .accept("text/plain")
                        .put(is);

                Path outFile = Paths.get("C:/data/test_out/out_"+i+".txt");
                if (Files.isRegularFile(outFile)) {
                    Files.delete(outFile);
                }
                Files.copy((InputStream)response.getEntity(), outFile);
                System.out.println("RESULT: " + response.getStatus());
                if (response.getStatus() != 200) {
                    ex++;
                }
            }
            fileCount++;
        }

I don't get any exceptions.

I'm not sure if it is the delay that copying the bytes out is preventing the exceptions or
if reading the content is necessary to prevent the exceptions.


Also, my memory usage with the new parser never goes above 300MB.  Are you sure you're using
the new parser?

> Multiple requests on Tika to extract text slows down
> ----------------------------------------------------
>
>                 Key: TIKA-2180
>                 URL: https://issues.apache.org/jira/browse/TIKA-2180
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.13, 1.14
>         Environment: Windows OS, Open JDK, 4 core 32 GB RAM
>            Reporter: Ashish Basran
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, with new experimental
SAX docx parser.png
>
>
> I observed that if I send multiple requests to Tika (eg. http://localhost:8080/tika)
with around 5MB files, Tika is very slow in completing the action. I tried with ~20 random
files, it took 170 seconds to process all the files in sequence. If I pass all files in parallel,
it took around 780 seconds to process same set of files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message