tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arturo Beltran <arturo.belt...@uji.es>
Subject Re: Getting started
Date Tue, 13 Jul 2010 10:28:15 GMT
Hi Chris and all,

El 07/07/2010 16:04, Mattmann, Chris A (388J) escribió:
> Hi Arturo,
>
> How exactly are you calling your parser? Are you using the AutoDetectParser? If so, can
you put some print statements in in the public void parse(...) method of CompositeParser?
Specifically, add a line right after:
>    
I'm calling my parser using the Tika-app included, so I think I'm using 
AutoDetectParser.

>
> Parser parser = getParser(metadata);
> // print out the returned parser
> System.out.println("Parser returned is: ["+parser.getClass().getName()+"]");
>
> What does that return? Also, have you done the work to map your incoming document type
in the tika-mimetypes.xml file?
Yes, sure.
>   That is, if you're using AutoDetectParser or anything that extends CompositeParser,
the mime type of the incoming document is used to determine what parser gets called? Is the
mime type being detected appropriately? You can check this by putting a println right before
getParser in the parse(...) method:
>    
Yes, it returns "application/shp"
> // print the mime type
> System.out.println("The MIME type is: ["+ metadata.get(Metadata.CONTENT_TYPE)+"]);
> Parser parser = getParser(metadata);
>
> What does that print out?
>
> Finally if both of these printlns check out, you should check and make sure that your
new parser is correctly mapped to the media type it supports, in other words what Ken said
below. Does your parser declare that it supports your expected MIME type?
>    
Yes I declared this MIME type in my parser. But the 
/getSupportedTypes(context)/ function is never called.

I uploaded a file with the Tika source code that includes my modified 
/tika-mimetypes.xml/ file and my new parser /GeoParser.java/. Perhaps 
one of you will try it and find out where I'm wrong.
Here the link: http://elcano.dlsi.uji.es/arturo/tika_geo.zip


Greetings and thanks in advance for your help,
      Arturo
> Let me know and thanks!
>
> Cheers,
> Chris
>
>
>
>
> On 7/7/10 4:25 AM, "Arturo Beltran"<arturo.beltran@uji.es>  wrote:
>
> Hi,
>
> I'm still with the same problem.
> I think it's all good, I do the/ "mvn install/" and my new class is
> included in the generated JAR, but never called.
> It should be very simple. I feel a little silly. I don't know how to
> make my new parser is found by Tika.
>
> Thanks in advance
>        Arturo
>
>
> El 21/06/2010 19:04, Ken Krugler escribió:
>    
>> Are you sure your new parser is on the classpath?
>>
>> E.g. put a break on getSupportedTypes() and make sure that's getting
>> called - if not, then the parser isn't being "found" by Tika.
>>
>> -- Ken
>>
>> On Jun 21, 2010, at 3:34am, Arturo Beltran wrote:
>>
>>      
>>> Hi Ken,
>>>
>>> First of all, thanks for your quick response.
>>> This's exactly what I'm doing, but despite that Tika recognizes the
>>> new MIME tipe, my new parser is not called.
>>>
>>> I added to tika-mimetypes.xml:
>>>
>>> <mime-type type="application/shp">
>>> <!--sub-class-of type="application/octet-stream"/-->
>>> <glob pattern="*.shp"/>
>>> </mime-type>
>>>
>>> I created a new class GeoParser:
>>>
>>> public class GeoParser implements Parser {
>>>
>>>     private static final Set<MediaType>  SUPPORTED_TYPES =
>>> Collections.singleton(MediaType.application("shp"));
>>>     public static final String SHP_MIME_TYPE = "application/shp";
>>>
>>>     public Set<MediaType>  getSupportedTypes(ParseContext context) {
>>>         return SUPPORTED_TYPES;
>>>     }
>>>
>>>     public void parse(
>>>             InputStream stream, ContentHandler handler,
>>>             Metadata metadata, ParseContext context)
>>>             throws IOException, SAXException, TikaException {
>>>
>>>         metadata.set(Metadata.CONTENT_TYPE, SHP_MIME_TYPE);
>>>         metadata.set("Hello", "World");
>>>
>>>         System.out.println("HELLO WORLD");
>>>         System.err.println("ERR Hello world");
>>>
>>>         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler,
>>> metadata);
>>>         xhtml.startDocument();
>>>         xhtml.endDocument();
>>>     }
>>> ...
>>> }
>>>
>>> And that's the result:
>>>
>>> Content-Length:  755072
>>> Content-Type:  application/shp
>>> resourceName:  comarques250.shp
>>>
>>> I don't know wht exactly is failing, but I can't make it work.
>>>
>>> Greetings and thanks in advance for your help.
>>>      Arturo
>>>
>>>
>>> El 17/06/2010 18:25, Ken Krugler escribió:
>>>        
>>>> Hi Arturo,
>>>>
>>>>          
>>>>> Some of you already know that I'm working on a new parser
>>>>> (https://issues.apache.org/jira/browse/TIKA-443). After all day
>>>>> trying to set up a workspace for Eclipse, I implemented the typical
>>>>> "hello world" class, in the Tika Parser version. My problem now, is
>>>>> how to configure Tika in order to call my new parser when a file
>>>>> with especific extension (p.e. *.shp) is found. I read something
>>>>> about a configuration file (tika-config.xml) but I couldn't find it
>>>>> in the source code.
>>>>>            
>>>> You first need to modify
>>>> tika-core/src/main/resources/tika-mimetypes.xml.
>>>>
>>>> E.g. something like this was done for mailbox files.
>>>>
>>>> <mime-type type="application/mbox">
>>>> <sub-class-of type="text/plain"/>
>>>> <glob pattern="*.mbox"/>
>>>> </mime-type>
>>>>
>>>> That maps the suffix to the mime-type.
>>>>
>>>> Then you define the SUPPORTED_TYPES static class field in your
>>>> parser class that defines what mime-types it supports.
>>>>
>>>> E.g. for MboxParser:
>>>>
>>>> public class MboxParser implements Parser {
>>>>
>>>>     private static final Set<MediaType>  SUPPORTED_TYPES =
>>>>         Collections.singleton(MediaType.application("mbox"));
>>>>
>>>>
>>>> -- Ken
>>>>
>>>> --------------------------------------------
>>>> <http://ken-blog.krugler.org>
>>>> +1 530-265-2225
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --------------------------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://bixolabs.com
>>>> e l a s t i c   w e b   m i n i n g
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>          
>>>
>>> --
>>> Arturo Beltran Fonollosa
>>> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
>>> Geographic Information research group: http://www.geoinfo.uji.es
>>> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
>>> E-12071, Castellón, Spain
>>> mailto: arturo.beltran@uji.es
>>>
>>>        
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>>      
>
> --
> Arturo Beltran Fonollosa
> Institute of New Imaging Technologies (INIT): http://www.init.uji.es
> Geographic Information research group: http://www.geoinfo.uji.es
> Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
> E-12071, Castellón, Spain
> mailto: arturo.beltran@uji.es
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>    


-- 
Arturo Beltran Fonollosa
Institute of New Imaging Technologies (INIT): http://www.init.uji.es
Geographic Information research group: http://www.geoinfo.uji.es
Universitat Jaume I, Avda. de Vicente Sos Baynat s/n
E-12071, Castellón, Spain
mailto: arturo.beltran@uji.es


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message