nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17
Date Fri, 15 Dec 2017 12:44:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292450#comment-16292450
] 

Sebastian Nagel commented on NUTCH-2439:
----------------------------------------

Really? I've almost done with a PR for the upgrade (had to resolve a dependency conflict which
breaks multiple parse-tika tests), but the amount of errors written to stderr is still hardly
acceptable:
{noformat}
$ bin/nutch parsechecker -Dplugin.includes="protocol-http|parse-tika" http://localhost/nutch/test.pdf
>/dev/null
Dec 15, 2017 1:37:59 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Dec 15, 2017 1:37:59 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Dec 15, 2017 1:37:59 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
{noformat}



> Upgrade to Apache Tika 1.17
> ---------------------------
>
>                 Key: NUTCH-2439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2439
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.14
>
>         Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message