tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Tika discussions in Amsterdam
Date Tue, 08 May 2007 09:17:58 GMT
Hi,

On 5/3/07, Rida Benjelloun <rida.benjelloun@doculibre.com> wrote:
> Lius is currently under apache licence.  If people are interested on it we
> can use it as starting point for the development of tika.

I think that would be great. We discussed in the ApacheCon that
selecting a single existing codebase as the starting point would be
the quickest way to bootstrap our efforts, and Lius and the Nutch
parsers are probably the best candidates for this.

The only downside in doing that is that it might cause trouble later
on when we want to refactor things to be more general. For Lius the
main problem is tight integration with Lucene. For example the
lius.index.Indexer class imports the following:

    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.store.RAMDirectory;

Optimally the Tika toolkit should have no compile-time dependencies to
Lucene. Do you think it would be feasible to refactor the Lius classes
to avoid the Lucene dependencies?

> Structured text: lius use JDOM, XPATH and namespaces for the extraction of
> structured contents.

Could you describe this in more detail. What does the XML content
model look like? I could just look at the source, but it's more
productive if we discuss the design on the mailing list.

> Sax could be more powerful but does not offer  XPATH for the extraction
> of contents.

It's possible to transform a SAX stream into a DOM tree for easy XPath
access so I don't think we lose any functionality by choosing SAX over
a DOM model. In fact it is even possible to evaluate XPath expressions
against a live SAX stream, you just won't get full DOM nodes as the
results.

> If you have have questions about Lius do not hesitate to communicate with
> me. The source code is available: http://sourceforge.net/projects/lius/

Why do you have the class files instead of the java files in Lius svn?

BR,

Jukka Zitting

Mime
View raw message