nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1537) Legacy metadata package needs to take advantage of Apache Tika metadata package more.
Date Sat, 02 Mar 2013 22:55:13 GMT


Lewis John McGibbney commented on NUTCH-1537:

Yeah there is a fair bit of duplication which was actually the initial driver for me to improve
this aspect of Nitch. Over time we can work to reduce and certainly improve the code.
Regarding the above
1. I agree here. We don't need, and would be wasting our time, to implement everything from
Apache Tika. If/when one of us moves on, it becomes a pain for new and existing developers
to manage the code.
2. Well it is not as if we are moving away from Apache Tika any time soon. There was a huge
effort to move the Tika stuff out of Nutch, which meant that we have a direct dependency upon
the project. Though some can see this dependency as a limitation on the Nutch side, Tika are
making relases and the community seems to be in a healthy state so I don't personally consider
this as a limitation. If things change in Tika, then we change them in Nutch if and when we
can. Until then we m ake best use of the code. I would not disagree with your suggestion on
this one.
3. I don't see the additional tika-core libraries as an issue here. If we use the code in
a more (dependency rich) inclusive nature then I think overall it is better for Nutch.

Thanks for providing the explicit options as above Sebastian. I think for the time being we
should try to get consensus on which one(s) to progress with.
> Legacy metadata package needs to take advantage of Apache Tika metadata package more.
> -------------------------------------------------------------------------------------
>                 Key: NUTCH-1537
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.6, 2.1
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.7, 2.2
> In Nutch, classes from the metadata package are being used in quite a number of places.
It is not currently being used to reflect the work going on in Apache Tika and we need to
better leverage the vocabularies available to us from the dependency on Apache Tika.
> The introduction of TikaCoreProperties in Tika 1.2 is not currently leveraged in Nutch.
This is just one example of an improved way for us to add metadata to Nutch documents.  

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message