nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ned Rockson <...@discoveryengine.com>
Subject Re: Tika API
Date Wed, 07 Nov 2007 19:13:09 GMT
That's strange - I did an svn up at the root of the nutch-trunk  
directory and merged all of the changes with my code base.  I must  
have missed the changes to the conf director when merging as I was  
only diffing the src directory.

--Ned
On Nov 6, 2007, at 7:05 PM, Chris Mattmann wrote:

> [..snip..]
>
>>     return type.getName();
>>   }
>>
>>
>> The NPE was being thrown on the last line, so I did some tracing and
>> found out that the call to MimeType.clean(typeName) [typeName <-
>> "text/html] worked fine, but the next line caused a problem.  The
>> this.mimeTypes.getRepository.forName(cleanedMimeType) was returning
>> null.  My problem was that I downloaded the trunk and it didn't  
>> have a
>> MimeUtils anymore so I had no way to trace this.
>
> Yes, this class was removed as part of NUTCH-562. Its usage was  
> replaced
> with the class of the same name within the Tika API, which is based  
> on the
> Nutch API for mime types.
>
>>
>> Anyway, after an hour or so of banging my head against the wail I
>> realized the update to Nutch didn't have the correct .xml file
>> describing mime types in the conf/ directory.  Thus, I unzipped  
>> the Tika
>> jar, grabbed the .xml file and changed nutch-default.xml to point to
>> that xml for mime types and it started working.
>
> This is strange: as part of the patch for NUTCH-562, there was a  
> file called
> tika-mimetypes.xml, that was committed to the conf/ folder within  
> the trunk.
> Do you not have this file? The nutch-default.xml file within the conf/
> folder in the nutch trunk points to the tika-mimetypes.xml, so that  
> should
> have worked. I'm wondering if you had an old version of the /conf  
> directory
> and neglected to svn up it?
>
>>
>> Sorry again for being so vague.  I'm not sure if I should submit a  
>> JIRA
>> issue for this, but I'm happy to do so if anyone else has seen  
>> this issue.
>
> No problem: let's discuss the JIRA issue once we get an answer to  
> the above
> questions.
>
> Thanks for being more descriptive and looking forward to your  
> response.
>
> Cheers,
>   Chris
>
>>
>> Thanks,
>> Ned
>>
>>
>> Chris Mattmann wrote:
>>> Hi Ned,
>>>
>>>  Glad to see you're poking around with the Tika software and its  
>>> use in
>>> Nutch. To start, you probably want to go to the website for Tika:
>>>
>>>  http://incubator.apache.org/tika/
>>>
>>>  On that website, you should see the links to the SVN repository.  
>>> The
>>> version of Tika that was used was a version that I built the same  
>>> day I
>>> committed the fix for NUTCH-562:
>>>
>>>  http://issues.apache.org/jira/browse/NUTCH-562
>>>
>>>  Which appears to be a version of Tika built on October 8th. The  
>>> API for the
>>> mime framework has changed a bit since then (to its betterment),  
>>> however, I
>>> neglected to upgrade the Nutch API because of the strong objection I
>>> received from Andrzej and input from Dennis Kubes regarding the  
>>> use of the
>>> Tika API in Nutch. I stand by my email I sent in reply to the  
>>> objections:
>>>
>>>  http://www.nabble.com/forum/ViewPost.jtp?post=13142174&framed=y
>>>
>>>  However, out of respect for the other committers, neglected to  
>>> make any
>>> updates to the Nutch use of the Tika API since I never heard back  
>>> from
>>> anyone after my response.
>>>
>>>  That said, could you be a bit more specific Ned as to the exact  
>>> problem
>>> you're having, e.g., "I tried visiting this site (URL here), the  
>>> content
>>> type was (content/type here), and then it got into Content.java,  
>>> and on line
>>> XXX it seems that the MimeType is getting set to null when it  
>>> tries to...".
>>> With that info, I could probably help you quite a bit more. Also,  
>>> depending
>>> upon how the rest of the Nutch committers want to handle the use  
>>> of Tika
>>> (revert and remain stagnant, or use Tika and leverage the updates  
>>> we're
>>> making to the Mime framework there), then we could come up with a  
>>> strategy
>>> to help you out with the issue you're having.
>>>
>>> Thanks!
>>>
>>> Cheers,
>>>   Chris
>>>
>>>
>>>
>>> On 11/6/07 3:47 PM, "Ned Rockson" <ned@discoveryengine.com> wrote:
>>>
>>>
>>>> I think there may be a bug in the Content.java when it tries to  
>>>> convert
>>>> the textual representation of the type to a MimeType.  It always  
>>>> returns
>>>> null.  I'm trying to fix it but I can't find an API for Tika (or  
>>>> even
>>>> src).  Can someone point me in the right direction?
>>>>
>>>> Thanks,
>>>> Ned
>>>>
>>>
>>> ______________________________________________
>>> Chris Mattmann, Ph.D.
>>> Chris.Mattmann@jpl.nasa.gov
>>> _________________________________________________
>>> Jet Propulsion Laboratory            Pasadena, CA
>>> Office: 171-266B                     Mailstop:  171-246
>>> _______________________________________________________
>>>
>>> Disclaimer:  The opinions presented within are my own and do not  
>>> reflect
>>> those of either NASA, JPL, or the California Institute of  
>>> Technology.
>>>
>>>
>>>
>>>
>>
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> Chris.Mattmann@jpl.nasa.gov
> Cognizant Development Engineer
> Early Detection Research Network Project
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not  
> reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>


Mime
View raw message