tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2725) Make tika-server robust against ooms/infinite loops/memory leaks
Date Thu, 06 Sep 2018 15:29:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605925#comment-16605925

Tim Allison commented on TIKA-2725:

bq. What is tika-server typical env? stand-alone, distributed ... like replicas in cluster?

It varies, I'm sure.  Not sure what most common use case is.  I would hope distributed --
swarm or similar.

bq. Are there some time limitation for recovery?

I think whoever starts the server should be able to set the threshold for timeouts per file...although
I may misunderstand your question.

bq.  How do we know what point to start processing from?
That wouldn't be tika-server's problem.  Clients calling tika-server would get an error message,
or potentially no response within a socket/http timeout range.  They should not reprocess
those docs.

bq. Do we mark documents which were processed?
Same as above, that's a client concern.

bq. For example, if tika-server had run on Docker swarm/K8S then orchestrator would have restarted
a failed replica itself
To confirm that I understand this correctly, currently, if the tika-server process dies, swarm/k8s
will automatically restart it?  That's good to hear.  However, we don't currently have the
watcher thread within tika-server to kill its own process on oom/timeout...so as it is now,
it would have to be something catastrophic taking down tika-server (operating system, perhaps?).

> Make tika-server robust against ooms/infinite loops/memory leaks
> ----------------------------------------------------------------
>                 Key: TIKA-2725
>                 URL: https://issues.apache.org/jira/browse/TIKA-2725
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
> Currently, tika-server is vulnerable to ooms, inifinite loops and memory leaks.  I see
two ways of making it robust:
> 1) use the ForkParser
> 2) have tika-server spawn a child process that actually runs the server, put a watcher
thread in the child that will kill the child on oom/timeout/after x files.  The parent process
can then restart the child if it dies. 
> I somewhat prefer 2) so that we don't have to doubly pass the inputstream.  I propose
2), and I propose making it optional in Tika 1.x, but then the default in Tika 2.x.  We could
also add a status ping from parent to child in case the child gets caught up in stop the world
gc (h/t [~bleskes]).
> Other options/recommendations?

This message was sent by Atlassian JIRA

View raw message