nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter
Date Thu, 23 Apr 2015 21:40:38 GMT


Lewis John McGibbney commented on NUTCH-1985:

[~jorgelbg] +1 please commit against trunk :)

> Adding a main() method to the MimeTypeIndexingFilter
> ----------------------------------------------------
>                 Key: NUTCH-1985
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, metadata, plugin
>    Affects Versions: 1.10
>            Reporter: Jorge Luis Betancourt Gonzalez
>            Priority: Minor
>              Labels: features, patch, test
>             Fix For: 1.10
>         Attachments: NUTCH-1985.patch
> This make very easy the testing of different rules files to check the expressions used
to filter the content based on the MIME type detected. Until now the only way to check this
was to do test crawls and check the stored data in Solr/Elasticsearch. 
> This allows calling the file using the {{bin/nutch plugin}} command, something like:
> {{bin/nutch plugin mimetype-filter org.apache.nutch.indexer.filter.MimeTypeIndexingFilter
> Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} for specifying
a rules file to be used, this makes easy to play with different rules file until you get the
desired behavior. 
> After invoking the class, a valid MIME type must be entered for each line, and the output
will be the same MIME type with a {{+}} or {{-}} sign in the beginning, indicating if the
given MIME type is allowed or denied respectively.

This message was sent by Atlassian JIRA

View raw message