nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1785) Ability to index raw content
Date Tue, 27 May 2014 13:50:02 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-1785:
---------------------------------

    Attachment: NUTCH-1785-trunk.patch

* command is -addBinaryContent
* field is binaryContent
* for Solr field is passed through stripNonCharCodePoint for obvious reasons
* added -base64 option to allow users to index real binary content and not just plain (X)HTML
but also images, mp3, whatever

This seems to work fine now, not weird exceptions, not enabled by default (had a wrong boolean),
base and non base works.

> Ability to index raw content
> ----------------------------
>
>                 Key: NUTCH-1785
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1785
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured indexing
back-end. Since Content is never read, a plugin is out of the question and therefore we need
to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message