nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] Created: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument
Date Fri, 19 Nov 2010 17:56:15 GMT
LanguageIdentifier should not set empty lang field on NutchDocument

                 Key: NUTCH-936
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.2
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
            Priority: Minor
             Fix For: 1.3, 2.0

For some reason the language identifier plugin sometimes sets an empty value for the lang
field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd
to proper text. Anyway, whether it's a problem with the parser or not, the plugin itself should
not add an empty value. The plugin already checks for a null value and then sets the lang
field to `unknown`, which is fine. But when the lang string is empty, it should also be set
to `unknown`.

This might break clients that have conditional logic on the empty value, but not on the `unknown`
value because it may never have occurred in their set up and therefore they might not have
added `unknown` to their logic.

However, it might seem a little bit overkill to put this proposal behind a configuration option
and let Nutch by default continue to behave as it currently does. Any thoughts on this one?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message