tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Assigned: (TIKA-466) Feed Parser
Date Fri, 16 Jul 2010 15:39:51 GMT

     [ https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris A. Mattmann reassigned TIKA-466:

    Assignee: Chris A. Mattmann

> Feed Parser
> -----------
>                 Key: TIKA-466
>                 URL: https://issues.apache.org/jira/browse/TIKA-466
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: TIKA-466.patch
> We currently have no parsers for feeds in Tika and since we are progressively getting
rid of our legacy parsers in Nutch I thought it could make sense to have one.
> The patch attached is based on the ROME feed parser (https://rome.dev.java.net/) which
is under Apache License. Rome provides a unified API for different feed formats and seems
well maintained.
> The implementation of the FeedParser is by no means complete but should serve as a basis
for further improvements. It currently stores the title and description from the feed and
stores them in the metadata and uses the following XHTML representation for the entries :

> <P>
> </P> 
> This is pretty basic but should at least allow us to retrieve the outlinks in Nutch as
well as some text. 
> J. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message