tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-466) Feed Parser
Date Fri, 16 Jul 2010 11:22:51 GMT
Feed Parser
-----------

                 Key: TIKA-466
                 URL: https://issues.apache.org/jira/browse/TIKA-466
             Project: Tika
          Issue Type: New Feature
          Components: parser
            Reporter: Julien Nioche
            Priority: Minor
         Attachments: TIKA-466.patch

We currently have no parsers for feeds in Tika and since we are progressively getting rid
of our legacy parsers in Nutch I thought it could make sense to have one.

The patch attached is based on the ROME feed parser (https://rome.dev.java.net/) which is
under Apache License. Rome provides a unified API for different feed formats and seems well
maintained.

The implementation of the FeedParser is by no means complete but should serve as a basis for
further improvements. It currently stores the title and description from the feed and stores
them in the metadata and uses the following XHTML representation for the entries : 

<A href="ENTRY_URL">ENTRY_TITLE</A>
<P>
ENTRY_DESCRIPTION
</P> 

This is pretty basic but should at least allow us to retrieve the outlinks in Nutch as well
as some text. 

J. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message