nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (NUTCH-562) Port mime type framework to use Tika mime detection framework
Date Tue, 09 Oct 2007 00:24:50 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann resolved NUTCH-562.
-------------------------------------

    Resolution: Fixed

- Applied patch, with minor changes to use static version of MimeUtils Tika interface, and
to only instantiate once per object family
- Tested on small crawl of apache.org sites, mime type set appropriately

> Port mime type framework to use Tika mime detection framework
> -------------------------------------------------------------
>
>                 Key: NUTCH-562
>                 URL: https://issues.apache.org/jira/browse/NUTCH-562
>             Project: Nutch
>          Issue Type: Improvement
>          Components: mime_type_detector
>    Affects Versions: 1.0.0
>         Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS X 10.4
although improvement is indep of env
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NUTCH-562.Mattmann.patch.txt, tika-0.1-dev.jar
>
>
> With Tika (http://incubator.apache.org/tika/) nearing  a stable 0.1 release candidate,
I think it would be a good time to patch Nutch to use Tika's mime detection system (an improvement
over the existing Nutch one written primarily by Jerome). Tika's mime system is based on the
mime system from Freedesktop.org and includes several improvements over the existing Nutch
mime system such as:
> 1. reliable XML-based content detection (a clear issue plaguing Nutch for some time now),
ability to delineate between RSS, XML, ATOM, etc.
> 2. mime magic pattern matching, including support for multiple patterns
> 3. glob pattern matches (ability to support > 1)
> I'll get together a patch and then attach it to the list once it's relatively stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message