tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rida Benjelloun" <rida.benjell...@doculibre.com>
Subject Re: Parser roadmap
Date Sat, 06 Oct 2007 18:00:31 GMT
Hi Jukka,
Totally agree with the parser roadmap. Thanks for this good job. I also
agree with replacing Content class by Matadata class, however the metadata
class should not be limited to one metadata standard example DublinCore, I
think that metadata class should be extensible or generic to support
multiple metadata standards.

Regards.

On 10/5/07, Chris Mattmann <chris.mattmann@jpl.nasa.gov> wrote:
>
> Hi Jukka,
>
> > Once TIKA-43 is committed (I'm giving it a day or two for reviews and
> > comments) there are still two Parser related changes that I'd like to
> > do before I think we're ready to do the first 0.1 release.
>
> +1, agreed. At present, we've worked through 30 JIRA issues so far (great
> job guys!), and I think that the library is reaching stability and is
> primed
> for an official release.
>
> I'll put my name out there as someone available to be the release master
> when the time comes. I've done it on Nutch before and wouldn't mind doing
> it
> for Tika. Just let me know if you guys agree.
>
> >
> > First, I'd like to replace the current Iterable<Content> construct
> > with a Metadata object that allows metadata to be passed in and out of
> > the parser. Also, this Metadata object should be decoupled from parser
> > configuration.
>
> I completely agree. I'd like to help with this issue as the Metadata
> framework is very near and dear to my heart. What's the interface that you
> are proposing for it look like again? Something like:
>
> String parse(InputStream stream, Metadata metadata)
>              throws IOException, TikaException;
>
>
> >
> > Second, instead of returning the text content of a document as a
> > String, I'd like the parsers to generate SAX events with the text
> > content passed as characters() events.
>
> Then, the next evolutionary step would be:
>
> SAXEvent parse(InputStream stream, Metadata metadata)
>             throws IOException, TikaException;
>
> ?
>
> >
> > Unless anyone objects (feel free to do so if you have better design
> > ideas!), I'll follow up with new patches for these two issues in the
> > next week or two. Once these changes are done, I think we're good to
> > go for the first Tika release. Such a timing would also be perfect for
> > the upcoming ApacheCon US conference. :-)
>
> Totally agree! Great job so far: I am really starting to like this new
> Parsing interface...
>
> Cheers,
>   Chris
>
> >
> > BR,
> >
> > Jukka Zitting
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> Chris.Mattmann@jpl.nasa.gov
> Cognizant Development Engineer
> Early Detection Research Network Project
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>


-- 
---------------------------------------------------------
Rida Benjelloun
Doculibre inc.
ridabenjelloun@apache.org
rida.benjelloun@doculibre.com
Cel: 418-262-3222
Tel: 418-353-3390
Site Web : http://www.doculibre.com
---------------------------------------------------------

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message