nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: Tika API
Date Wed, 07 Nov 2007 03:05:45 GMT
[..snip..]

>     return type.getName();
>   }
> 
> 
> The NPE was being thrown on the last line, so I did some tracing and
> found out that the call to MimeType.clean(typeName) [typeName <-
> "text/html] worked fine, but the next line caused a problem.  The
> this.mimeTypes.getRepository.forName(cleanedMimeType) was returning
> null.  My problem was that I downloaded the trunk and it didn't have a
> MimeUtils anymore so I had no way to trace this.

Yes, this class was removed as part of NUTCH-562. Its usage was replaced
with the class of the same name within the Tika API, which is based on the
Nutch API for mime types.

> 
> Anyway, after an hour or so of banging my head against the wail I
> realized the update to Nutch didn't have the correct .xml file
> describing mime types in the conf/ directory.  Thus, I unzipped the Tika
> jar, grabbed the .xml file and changed nutch-default.xml to point to
> that xml for mime types and it started working.

This is strange: as part of the patch for NUTCH-562, there was a file called
tika-mimetypes.xml, that was committed to the conf/ folder within the trunk.
Do you not have this file? The nutch-default.xml file within the conf/
folder in the nutch trunk points to the tika-mimetypes.xml, so that should
have worked. I'm wondering if you had an old version of the /conf directory
and neglected to svn up it?

> 
> Sorry again for being so vague.  I'm not sure if I should submit a JIRA
> issue for this, but I'm happy to do so if anyone else has seen this issue.

No problem: let's discuss the JIRA issue once we get an answer to the above
questions.

Thanks for being more descriptive and looking forward to your response.

Cheers,
  Chris

> 
> Thanks,
> Ned
> 
> 
> Chris Mattmann wrote:
>> Hi Ned,
>> 
>>  Glad to see you're poking around with the Tika software and its use in
>> Nutch. To start, you probably want to go to the website for Tika:
>> 
>>  http://incubator.apache.org/tika/
>> 
>>  On that website, you should see the links to the SVN repository. The
>> version of Tika that was used was a version that I built the same day I
>> committed the fix for NUTCH-562:
>> 
>>  http://issues.apache.org/jira/browse/NUTCH-562
>> 
>>  Which appears to be a version of Tika built on October 8th. The API for the
>> mime framework has changed a bit since then (to its betterment), however, I
>> neglected to upgrade the Nutch API because of the strong objection I
>> received from Andrzej and input from Dennis Kubes regarding the use of the
>> Tika API in Nutch. I stand by my email I sent in reply to the objections:
>> 
>>  http://www.nabble.com/forum/ViewPost.jtp?post=13142174&framed=y
>> 
>>  However, out of respect for the other committers, neglected to make any
>> updates to the Nutch use of the Tika API since I never heard back from
>> anyone after my response.
>> 
>>  That said, could you be a bit more specific Ned as to the exact problem
>> you're having, e.g., "I tried visiting this site (URL here), the content
>> type was (content/type here), and then it got into Content.java, and on line
>> XXX it seems that the MimeType is getting set to null when it tries to...".
>> With that info, I could probably help you quite a bit more. Also, depending
>> upon how the rest of the Nutch committers want to handle the use of Tika
>> (revert and remain stagnant, or use Tika and leverage the updates we're
>> making to the Mime framework there), then we could come up with a strategy
>> to help you out with the issue you're having.
>> 
>> Thanks!
>> 
>> Cheers,
>>   Chris
>> 
>> 
>> 
>> On 11/6/07 3:47 PM, "Ned Rockson" <ned@discoveryengine.com> wrote:
>> 
>>   
>>> I think there may be a bug in the Content.java when it tries to convert
>>> the textual representation of the type to a MimeType.  It always returns
>>> null.  I'm trying to fix it but I can't find an API for Tika (or even
>>> src).  Can someone point me in the right direction?
>>> 
>>> Thanks,
>>> Ned
>>>     
>> 
>> ______________________________________________
>> Chris Mattmann, Ph.D.
>> Chris.Mattmann@jpl.nasa.gov
>> _________________________________________________
>> Jet Propulsion Laboratory            Pasadena, CA
>> Office: 171-266B                     Mailstop:  171-246
>> _______________________________________________________
>> 
>> Disclaimer:  The opinions presented within are my own and do not reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>> 
>> 
>> 
>>   
> 

______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Mime
View raw message