nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2703) parse-tika: Boilerpipe should not run for non-(X)HTML pages
Date Thu, 11 Apr 2019 10:08:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815285#comment-16815285
] 

ASF GitHub Bot commented on NUTCH-2703:
---------------------------------------

sebastian-nagel commented on pull request #449: NUTCH-2703 parse-tika: Boilerpipe should not
run for non-(X)HTML pages
URL: https://github.com/apache/nutch/pull/449
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> parse-tika: Boilerpipe should not run for non-(X)HTML pages
> -----------------------------------------------------------
>
>                 Key: NUTCH-2703
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2703
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser, plugin
>    Affects Versions: 1.15
>            Reporter: Hany Shehata
>            Priority: Critical
>             Fix For: 1.16
>
>         Attachments: NUTCH-2703.patch
>
>
> Boilerpipe is running for non-(X)html pages which is require more resources.
> In my testing scenario, I've large PDFs in my websites and by enabling Boilerpipe I
have to assign 8500MB for JAVA Heap to finish the crawl job without issues.
> Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message