nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Lopez (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2032) Plugin to index the raw content of a readable document.
Date Wed, 03 Jun 2015 19:01:39 GMT
Luis Lopez created NUTCH-2032:
---------------------------------

             Summary: Plugin to index the raw content of a readable document. 
                 Key: NUTCH-2032
                 URL: https://issues.apache.org/jira/browse/NUTCH-2032
             Project: Nutch
          Issue Type: New Feature
          Components: indexer, parser
    Affects Versions: 1.10
            Reporter: Luis Lopez
             Fix For: 1.11


This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and 
https://issues.apache.org/jira/browse/NUTCH-1458

We created a couple plugins to index the raw content of readable documents. If we include
these plugins in the plugin chain we'll index the raw content of a readable document, i.e.
XML, HTML, CSV, TXT etc. The index-rawcontent plugin is not designed to index binary files,
however having the full content of an HTML/XML or a CSV document is really critical for some
of us.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message