tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-466) Feed Parser
Date Fri, 16 Jul 2010 11:22:51 GMT
Feed Parser

                 Key: TIKA-466
                 URL: https://issues.apache.org/jira/browse/TIKA-466
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Julien Nioche
            Priority: Minor
         Attachments: TIKA-466.patch

We currently have no parsers for feeds in Tika and since we are progressively getting rid
of our legacy parsers in Nutch I thought it could make sense to have one.

The patch attached is based on the ROME feed parser (https://rome.dev.java.net/) which is
under Apache License. Rome provides a unified API for different feed formats and seems well

The implementation of the FeedParser is by no means complete but should serve as a basis for
further improvements. It currently stores the title and description from the feed and stores
them in the metadata and uses the following XHTML representation for the entries : 


This is pretty basic but should at least allow us to retrieve the outlinks in Nutch as well
as some text. 


This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message