tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2963) Tika在抽取.xlsx类型的大文件时出现OOM错误
Date Fri, 11 Oct 2019 15:28:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949551#comment-16949551
] 

Tilman Hausherr edited comment on TIKA-2963 at 10/11/19 3:27 PM:
-----------------------------------------------------------------

"For docx and pptx type files, Tika can configure the SAX parser to improve decimation performance.
However, Tika still has an OOM error when extracting large files of type .xlsx. I have not
found a solution from the official. I have attached my own code below. It is also a solution
based on SAX parser. The code can be adjusted according to the actual situation. Excellent,
there are many shortcomings, everyone criticizes and corrects, thank you"

I wonder what is meant with "decimation performance". After deleting single words in google
translation, I suspect it is extraction. So what she/he means that the current solution uses
too much memory and the proposed SAX based solution is better.


was (Author: tilman):
"For docx and pptx type files, Tika can configure the SAX parser to improve decimation performance.
However, Tika still has an OOM error when extracting large files of type .xlsx. I have not
found a solution from the official. I have attached my own code below. It is also a solution
based on SAX parser. The code can be adjusted according to the actual situation. Excellent,
there are many shortcomings, everyone criticizes and corrects, thank you"

I wonder what is meant with "decimation performance". After deleting single words, I suspect
it is extraction. So what she/he means that the current solution uses too much memory and
the proposed SAX based solution is better.

> Tika在抽取.xlsx类型的大文件时出现OOM错误
> --------------------------
>
>                 Key: TIKA-2963
>                 URL: https://issues.apache.org/jira/browse/TIKA-2963
>             Project: Tika
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.20
>            Reporter: Feng Jiao Jiang
>            Priority: Major
>         Attachments: demo.java
>
>
> 对于docx和pptx类型的文件,Tika可配置SAX解析器来提高抽取性能。但是Tika在抽取.xlsx类型的大文件时仍会出现OOM错误,我暂时没有从官方找到解决方案,下面附上自己的代码,也是基于SAX解析器的解决方案,代码可根据实际情况进行参数调优,多有不足之处,大家批评指正,谢谢



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message