nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1537) Legacy metadata package needs to take advantage of Apache Tika metadata package more.
Date Fri, 01 Mar 2013 22:51:13 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591045#comment-13591045
] 

Sebastian Nagel commented on NUTCH-1537:
----------------------------------------

Removing stuff could be done in a few ways:
# let [o.a.nutch.metadata.MetaData|http://nutch.apache.org/apidocs-1.6/org/apache/nutch/metadata/Metadata.html]
implement all interfaces in [o.a.tika.metadata|http://tika.apache.org/1.3/api/org/apache/tika/metadata/package-summary.html]:
there are many because Tika is about providing metadata. But Nutch is mostly used to fill
an index with content and a few meta fields (the most useful for the user). So, do we really
need all those predefined meta fields? If some users want it, this still can be done in a
plugin.
# {{nutch.metadata.MetaData extends tika.MetaData implements nutch.metadata.Nutch}} : that
would mean to replace the Nutch implementation of the MetaData class by that of Tika. MetaData
is frequently used in Nutch simply as a key-multiple-value store. A dependency on Tika may
cause troubles if Tika decides to change this class.
# keep nutch.metadata.MetaData and the classes holding the string constants related to crawling
(metadata.Nutch and HttpHeaders). References from plugins (eg, feed or creativecommons) can
be removed if these refer directly to tika-core (little drawback: each of these plugins will
then contain tika-core.jar).

These possibilities are not mutually exclusive, and surely there are even more. I would vote
to keep the metadata package as legacy code but try to make it smaller and more crawler-specific
by removing the most obvious shared classes (@[~lewismc]: the amount of duplicated code is
striking).
                
> Legacy metadata package needs to take advantage of Apache Tika metadata package more.
> -------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1537
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1537
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.6, 2.1
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.7, 2.2
>
>
> In Nutch, classes from the metadata package are being used in quite a number of places.
It is not currently being used to reflect the work going on in Apache Tika and we need to
better leverage the vocabularies available to us from the dependency on Apache Tika.
> The introduction of TikaCoreProperties in Tika 1.2 is not currently leveraged in Nutch.
This is just one example of an improved way for us to add metadata to Nutch documents.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message