nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2032) Plugin to index the raw content of a readable document.
Date Wed, 03 Jun 2015 21:38:38 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571694#comment-14571694
] 

Sebastian Nagel commented on NUTCH-2032:
----------------------------------------

Hi [~betolink], your solution/patch already adds the raw HTML to parse meta data. No changes
to indexer (that's an advantage) at the price of storing the raw content twice in the segment.
Since we can change the indexer the solution from NUTCH-1785 seems favourable?

> Plugin to index the raw content of a readable document. 
> --------------------------------------------------------
>
>                 Key: NUTCH-2032
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2032
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>              Labels: content, index, index-rawcontent, parser, raw
>             Fix For: 1.11
>
>
> This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and 
> https://issues.apache.org/jira/browse/NUTCH-1458
> We created a couple plugins to index the raw content of readable documents. If we include
these plugins in the plugin chain we'll index the raw content of a readable document, i.e.
XML, HTML, CSV, TXT etc. The index-rawcontent plugin is not designed to index binary files,
however having the full content of an HTML/XML or a CSV document is really critical for some
of us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message