nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jorge Luis Betancourt Gonzalez (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-1928) Indexing filter of documents by the MIME type
Date Fri, 30 Jan 2015 22:14:34 GMT
Jorge Luis Betancourt Gonzalez created NUTCH-1928:
-----------------------------------------------------

             Summary: Indexing filter of documents by the MIME type
                 Key: NUTCH-1928
                 URL: https://issues.apache.org/jira/browse/NUTCH-1928
             Project: Nutch
          Issue Type: Improvement
          Components: indexer, plugin
            Reporter: Jorge Luis Betancourt Gonzalez
             Fix For: 1.10


This allows to filter the indexed documents by the MIME type property of the crawled content.
Basically this will allow you to restrict the MIME type of the contents that will be stored
in Solr/Elasticsearch index without the need to restrict the crawling/parsing process, so
no need to use URLFilter plugin family. Also this address one particular corner case when
certain URLs doesn't have any format to filter such as some RSS feeds (http://www.awesomesite.com/feed)
and it will end in your index mixed with all your HTML content.

A configuration can file specified on the {{mimetype.filter.file}} property in the {{nutch-site.xml}}.
This file use the same format as the {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}}
key is found an {{allow all}} policy is used instead, so all your crawled documents will be
indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message