nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yossi Tamari (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.
Date Mon, 28 Aug 2017 20:56:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144330#comment-16144330
] 

Yossi Tamari commented on NUTCH-2414:
-------------------------------------

Markus, if I understand correctly, there are two ways to implement what you suggest:
1. Add the functionality to every indexer plugin (after all the IndexingFilters are run)
2. Write an additional IndexingFilter plugin that returns null if the JEXL expression is false.
It will have to be configured to run after the other plugins that enrich the data.
Which one are you suggesting?

> Allow LanguageIndexingFilter to actually filter documents by language.
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-2414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2414
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those languages
that we intend to search in). At first glance it seems that this is done by LanguageIndexingFilter,
but currently all the filter does is add the language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow it to only
index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message