nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument
Date Fri, 19 Nov 2010 17:56:15 GMT
LanguageIdentifier should not set empty lang field on NutchDocument
-------------------------------------------------------------------

                 Key: NUTCH-936
                 URL: https://issues.apache.org/jira/browse/NUTCH-936
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.2
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
            Priority: Minor
             Fix For: 1.3, 2.0


For some reason the language identifier plugin sometimes sets an empty value for the lang
field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd
to proper text. Anyway, whether it's a problem with the parser or not, the plugin itself should
not add an empty value. The plugin already checks for a null value and then sets the lang
field to `unknown`, which is fine. But when the lang string is empty, it should also be set
to `unknown`.

This might break clients that have conditional logic on the empty value, but not on the `unknown`
value because it may never have occurred in their set up and therefore they might not have
added `unknown` to their logic.

However, it might seem a little bit overkill to put this proposal behind a configuration option
and let Nutch by default continue to behave as it currently does. Any thoughts on this one?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message