nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <ku...@apache.org>
Subject Re: Tika API
Date Wed, 07 Nov 2007 03:25:26 GMT


Chris Mattmann wrote:
> Hi Ned,
> 
>  Glad to see you're poking around with the Tika software and its use in
> Nutch. To start, you probably want to go to the website for Tika:
> 
>  http://incubator.apache.org/tika/
> 
>  On that website, you should see the links to the SVN repository. The
> version of Tika that was used was a version that I built the same day I
> committed the fix for NUTCH-562:
> 
>  http://issues.apache.org/jira/browse/NUTCH-562
> 
>  Which appears to be a version of Tika built on October 8th. The API for the
> mime framework has changed a bit since then (to its betterment), however, I
> neglected to upgrade the Nutch API because of the strong objection I
> received from Andrzej and input from Dennis Kubes regarding the use of the
> Tika API in Nutch. I stand by my email I sent in reply to the objections:
> 
>  http://www.nabble.com/forum/ViewPost.jtp?post=13142174&framed=y
> 
>  However, out of respect for the other committers, neglected to make any
> updates to the Nutch use of the Tika API since I never heard back from
> anyone after my response.

I looked back at the messages. I tried to respond as best I could.  If I 
missed something I apologize.

My only concern was the documentation for Tika (of course Nutch has the 
same problem :)) as I figured we would have this situation here, where 
someone was asking quesitons about Tika and didn't know where to turn. 
But since you and Sami were both committers for Tika and Nutch and are 
active here I thought it would be fine.

> 
>  That said, could you be a bit more specific Ned as to the exact problem
> you're having, e.g., "I tried visiting this site (URL here), the content
> type was (content/type here), and then it got into Content.java, and on line
> XXX it seems that the MimeType is getting set to null when it tries to...".
> With that info, I could probably help you quite a bit more. Also, depending
> upon how the rest of the Nutch committers want to handle the use of Tika
> (revert and remain stagnant, or use Tika and leverage the updates we're
> making to the Mime framework there), then we could come up with a strategy
> to help you out with the issue you're having.

The previous patches seem to work good, we have fetched well over 100M 
page without any problems.  I would say lets try to move things forward 
if you feel the Tika code is ready.  Maybe I am missing something.  I 
have not delved into Tika deeply but if it is better we should use it. 
It is poised to break something?

Dennis

> 
> Thanks!
> 
> Cheers,
>   Chris
> 
> 
> 
> On 11/6/07 3:47 PM, "Ned Rockson" <ned@discoveryengine.com> wrote:
> 
>> I think there may be a bug in the Content.java when it tries to convert
>> the textual representation of the type to a MimeType.  It always returns
>> null.  I'm trying to fix it but I can't find an API for Tika (or even
>> src).  Can someone point me in the right direction?
>>
>> Thanks,
>> Ned
> 
> ______________________________________________
> Chris Mattmann, Ph.D.
> Chris.Mattmann@jpl.nasa.gov
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
> 
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
> 
> 
> 

Mime
View raw message