nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1129) Any23 Nutch plugin
Date Mon, 28 Apr 2014 08:15:16 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982827#comment-13982827
] 

Sebastian Nagel commented on NUTCH-1129:
----------------------------------------

Hi [~lewismc], not yet. But I head a look on the patch. Looks good, in general! Some comments:
* dep to any23 jar is also in ivy/ivy.xml. Is a global dependency required? We recently had
a discussion about that topic [@user|http://mail-archives.apache.org/mod_mbox/nutch-user/201404.mbox/%3C535615BA.3050601%40raytion.com%3E].
* all extracted triples are finally stored in one multi-valued field, each triple represented
as string. That's not an optimal representation, regarding two (are there more?) possible
use cases: extract and index key-value pairs as structured content (cf. [@dev|http://mail-archives.apache.org/mod_mbox/nutch-dev/201204.mbox/%3C4F8DEC5B.8070705%40googlemail.com%3E]),
index into some triple store (as new indexer back-end)
* similar: isn't there a more efficient way to pass triples from parse to indexing filter
than tab-separated in a huge string (there may be many triples in one document!)

The latter two points aren't a blocker by no means. But we should think about evolving the
plugin and make it really usable.

> Any23 Nutch plugin
> ------------------
>
>                 Key: NUTCH-1129
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1129
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.3, 1.9
>
>         Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin which extracts
RDF data from HTTP and file resources. Although as of writing Any23 not part of the ASF, the
project is working towards integration into the Apache Incubator. Once the project proves
its value, this would be an excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message