nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Update of "TikaPlugin" by JulienNioche
Date Mon, 11 Jan 2010 16:34:41 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "TikaPlugin" page has been changed by JulienNioche.


  = Tika Plugin =
  The Tika plugin in is a first attempt at
delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch.
This page will list the differences in coverage or functionality between the Tika plugin and
the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not
described here and has a more generic capability of representing structured content which
can be useful for HtmlParseFilters (which are currently limited to HTML content).
- '''html''': ?
+ '''html''': comparable
  '''js''': ?
@@ -21, +21 @@

  '''rss''': ?
- '''rtf''': comparable
+ '''rtf''': deactivated in Nutch for licensing reasons | works in Tika
  '''swf''' : not yet covered in Tika (see

View raw message