nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] Updated: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument
Date Mon, 22 Nov 2010 13:09:14 GMT


Markus Jelsma updated NUTCH-936:

    Attachment: NUTCH-936-v13-1.patch

Here are patches for the current 1.2 stable, branch 1.3 and trunk. 

> LanguageIdentifier should not set empty lang field on NutchDocument
> -------------------------------------------------------------------
>                 Key: NUTCH-936
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.3, 2.0
>         Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, NUTCH-936-v13-1.patch
> For some reason the language identifier plugin sometimes sets an empty value for the
lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be
OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem
with the parser or not, the plugin itself should not add an empty value because the content
field can always be empty. The plugin already checks for a null value and then sets the lang
field to `unknown`, which is fine. But when the lang string is empty, it should also be set
to `unknown`.
> This might break clients that have conditional logic on the empty value, but not on the
`unknown` value because it may never have occurred in their set up and therefore they might
not have added `unknown` to their logic. However, it might seem a little bit overkill to put
this proposal behind a configuration option and let Nutch by default continue to behave as
it currently does. Any thoughts on this one?
> Here's the troublesome URL :
that returns an empty content field and an empty lang string in 1.2 and presumably in trunk
and other versions as well.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message