nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Lopez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2032) Plugin to index the raw content of a readable document.
Date Thu, 04 Jun 2015 17:06:38 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573148#comment-14573148
] 

Luis Lopez commented on NUTCH-2032:
-----------------------------------

Hi [~wastl-nagel], could you elaborate on what seems favourable? Yes this will increase the
size of the segments which is non trivial. I think that this plugin approach is less intrusive
with the current class signatures. It works well with our use case in which we don't need
the segments once that they are indexed.

> Plugin to index the raw content of a readable document. 
> --------------------------------------------------------
>
>                 Key: NUTCH-2032
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2032
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>              Labels: content, index, index-rawcontent, parser, raw
>             Fix For: 1.11
>
>
> This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and 
> https://issues.apache.org/jira/browse/NUTCH-1458
> We created a couple plugins to index the raw content of readable documents. If we include
these plugins in the plugin chain we'll index the raw content of a readable document, i.e.
XML, HTML, CSV, TXT etc. The index-rawcontent plugin is not designed to index binary files,
however having the full content of an HTML/XML or a CSV document is really critical for some
of us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message