nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-766) Tika parser
Date Thu, 11 Feb 2010 11:40:28 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832454#action_12832454
] 

Julien Nioche commented on NUTCH-766:
-------------------------------------

@Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto
 the directory parse-tika and ran the test just as you did but could not reproduce the problem.
 Could there be a difference between your version and the trunk?

@Sami :  

{quote} was there a reason not to use AutoDetect parser?  {quote} 
I suppose we could as long we give it a clue about the MimeType obtained from the Content.
 As you pointed out, there could be a duplication with the detection done by Mime-Util. I
suppose one way to do would be to add a new version of the method getParse(Content conte,
MimeType type). That's an interesting point.

{quote} Also was there a reson not to parse html wtih tika?  {quote} 
It is supposed to do so, if it does not then it's a bug which needs urgent fixing.

Regarding parsing package formats, I think the plan is that Tika will handle that in the future
but we could try to do that now if we find a relatively clean mechanism for doing so. BTW
could you please send a diff and not the full code of the class you posted earlier, that would
make the comparison much easier.




> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz,
TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via
SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism
of Tika but can still coexist with the existing parsing plugins which is useful for formats
partially handled by Tika (or not at all). Some of the elements below have already been discussed
on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different
jar files (core and parsers), in the work described here we decided to put the libs in 2 different
places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put
tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar
+ all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig
class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring
the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using
"*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so
that it considers the tika parser as potentially suitable for all mime-types. In practice
this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml
are useful only for the cases where we want to handle a mime type with a different parser
than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers
into DOM objects and reuse the utilities that come with the current HTML parser i.e. link
detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly
the same way. The main difference though is that HTMLParseFilters are not limited to HTML
documents anymore as the XHTML tags returned by Tika can correspond to a different format
for the original document. There is a duplication of code with the html-plugin which will
be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar
and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at
each individual format and check that it is covered by Tika and if so to the same extent;
the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter)
seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step.

> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message