tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2963) Tika在抽取.xlsx类型的大文件时出现OOM错误
Date Mon, 21 Oct 2019 16:30:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956240#comment-16956240
] 

Tim Allison commented on TIKA-2963:
-----------------------------------

Sorry for my delay.  I _think_ I've had time to look at the code.  I'm not sure how it is
substantially different from our current XSLX wrapper. 

If you could point out how yours is different, that might help.  If you're able to share your
doc that triggers an OOM, I can take a look at it...or, if you can unzip the file and let
us know how big the sheets are and how big the shared strings table is, that might help.


> Tika在抽取.xlsx类型的大文件时出现OOM错误
> --------------------------
>
>                 Key: TIKA-2963
>                 URL: https://issues.apache.org/jira/browse/TIKA-2963
>             Project: Tika
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.20
>            Reporter: Feng Jiao Jiang
>            Priority: Major
>         Attachments: demo.java
>
>
> 对于docx和pptx类型的文件,Tika可配置SAX解析器来提高抽取性能。但是Tika在抽取.xlsx类型的大文件时仍会出现OOM错误,我暂时没有从官方找到解决方案,下面附上自己的代码,也是基于SAX解析器的解决方案,代码可根据实际情况进行参数调优,多有不足之处,大家批评指正,谢谢



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message