nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions
Date Mon, 05 Mar 2012 15:37:56 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222404#comment-13222404
] 

Ferdy Galema commented on NUTCH-1253:
-------------------------------------

It indeed seems broken for trunk. When running it with default options in local mode, every
parse simply fails. This is pretty suprising. With the help of Dennis' instructions it indeed
becomes more clear what the error is about. Note that nutchgora is not affected. Though at
first sight they seem to be using the same library versions.

I'm amazed that this error has not been noticed earlier. I cannot speak for users/devs that
are on 1.x, so I kindly ask if one of them is able to pick this issue up. (Or least provide
some insight). My guess is that they either use tagsoup (instead of neko) or parse-tika for
html parsing. Then again if that's the case I don't know why the defaults are now the way
they are. Because of this I have not yet tested any of your patches, sorry Lewis.
                
> Incompatible neko and xerces versions
> -------------------------------------
>
>                 Key: NUTCH-1253
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1253
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>         Environment: Ubuntu 10.04
>            Reporter: Dennis Spathis
>         Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch
>
>
> The Nutch 1.4 distribution includes
>  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
> nekohtml)
>  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
> These two JARs appear to be incompatible versions. When the HtmlParser (configured to
use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError.
(Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a
> catch(Throwable) clause in the getParse method to log the stacktrace.)
> I found that substituting a later, compatible version of nekohtml (1.9.11)
> fixes the problem.
> Curiously, and in support of the above, the nekohtml plugin.xml file in
> Nutch 1.4 contains the following:
> <plugin
>    id="lib-nekohtml"
>    name="CyberNeko HTML Parser"
>    version="1.9.11"
>    provider-name="org.cyberneko">
>    <runtime>
>        <library name="nekohtml-0.9.5.jar">
>            <export name="*"/>
>        </library>
>    </runtime>
> </plugin>
> Note the conflicting version numbers (version tag is "1.9.11" but the
> specified library is "nekohtml-0.9.5.jar").
> Was the 0.9.5 version included by mistake? Was the intention rather to
> include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message