tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rida Benjelloun" <rida.benjell...@doculibre.com>
Subject Re: Tika discussions in Amsterdam
Date Sun, 13 May 2007 23:39:44 GMT
Hi,
My answer is in the text of the email

On 5/8/07, Jukka Zitting <jukka.zitting@gmail.com> wrote:
>
> Hi,
>
> On 5/3/07, Rida Benjelloun <rida.benjelloun@doculibre.com> wrote:
> > Lius is currently under apache licence.  If people are interested on it
> we
> > can use it as starting point for the development of tika.
>
> I think that would be great. We discussed in the ApacheCon that
> selecting a single existing codebase as the starting point would be
> the quickest way to bootstrap our efforts, and Lius and the Nutch
> parsers are probably the best candidates for this.
>
> The only downside in doing that is that it might cause trouble later
> on when we want to refactor things to be more general. For Lius the
> main problem is tight integration with Lucene. For example the
> lius.index.Indexer class imports the following:
>
>     import org.apache.lucene.document.Document;
>     import org.apache.lucene.document.Field;
>     import org.apache.lucene.store.Directory;
>     import org.apache.lucene.store.RAMDirectory ;
>
> Optimally the Tika toolkit should have no compile-time dependencies to
> Lucene. Do you think it would be feasible to refactor the Lius classes
> to avoid the Lucene dependencies?


This dependencies can easily be removed. The indexer class will be renamed
as a Parser class and will return document content and metadata etc.. Please
see the interface suggested at the end of this email

> Structured text: lius use JDOM, XPATH and namespaces for the extraction of
>
> > structured contents.
>
> Could you describe this in more detail. What does the XML content
> model look like? I could just look at the source, but it's more
> productive if we discuss the design on the mailing list.


Xml Parser use Namespace and XPATH to extract the content, if your document
is a dublin core (dc) document, the dc properties defined in the liusconfig
xml file will be used to extract the content. Tow type of extractions,
document extraction and xml node extraction.
XML document extraction allow you to store one XML document as a lucene
document. Node indexing allow you to store a document section as lucene
document. This can be interesting, for example if you want to index RSS
document and you want to store each news as an entry in the index.
Namespace allow you to apply the correct property for XML extraction.
Example of liusconfig :
LuceneField can be renamed field and all lucene specific properties can be
removed
        <!-- Dubin core extraction properties -->
        <xmlFile ns="http://purl.org/dc/elements/1.1/" setBoost="2.0">
                <indexer class="lius.index.xml.XmlFileIndexer">
                    <mime>text/xml</mime>
                </indexer>
                <fields>
                    <luceneField name="title" xpathSelect="//dc:title"
type="Text" setBoost="0.1"/>
                    <luceneField name="subject" xpathSelect="//dc:subject"
type="Keyword" setBoost="2.0"/>
                    <luceneField name="creator" xpathSelect="//dc:creator"
type="Text"/>
                </fields>
            </xmlFile>
           <!-- ETDMS extraction properties  -->
            <xmlFile ns="http://www.ndltd.org/standards/metadata/etdms/1.0/
">
                <indexer class="lius.index.xml.XmlFileIndexer">
                    <mime>text/xml</mime>
                </indexer>
                <fields>
                    <luceneField name="title" xpathSelect="//etdms:title"
type="Text"/>
                    <luceneField name="subject"
xpathSelect="//etdms:subject" type="Keyword"/>
                    <luceneField name="creator"
xpathSelect="//etdms:creator" type="Text"/>
                    <luceneField name="description"
xpathSelect="//etdms:description" type="Text"/>
                </fields>
            </xmlFile>
            <!-- no namespace in the document -->
            <xmlFile ns="default" setBoost="0.5">
                <indexer class="lius.index.xml.XmlFileIndexer">
                    <mime>text/xml</mime>
                </indexer>
                <fields>
                    <luceneField name="fullText" xpathSelect="//*"
type="Text" ocurSep=" "/>
                </fields>
            </xmlFile>


Lius use configuration to define the content extraction model .Each Indexer
contain a methode getPopulatedLiusFields witch return a collection of
Liusfields. Each Liusfield object contain the following information :
- Name
- Value (extracted from the xml document)
- Values (multiple values)
- XPath used to extract the information
- Some Lucene properties, like field type, boost etc. This properties can be
removed.
The config file must also be adapted to remove all lucene properties.

You can see Junit pakage in the source code to see somme code example .
Also here is a small tutorial :
http://www.doculibre.com/lius/doc-1.0_en.html




> Sax could be more powerful but does not offer  XPATH for the extraction
> > of contents.




It's possible to transform a SAX stream into a DOM tree for easy XPath
> access so I don't think we lose any functionality by choosing SAX over
> a DOM model. In fact it is even possible to evaluate XPath expressions
> against a live SAX stream, you just won't get full DOM nodes as the
> results.
>
> > If you have have questions about Lius do not hesitate to communicate
> with
> > me. The source code is available: http://sourceforge.net/projects/lius/
>
> Why do you have the class files instead of the java files in Lius svn?


I don't know whats happend, this is an error, I will fix it. Thanks. You can
download the latest release in sourceforge.

BR,
>
> Jukka Zitting



I Suggest to use a parser interface that will give us all the information we
need regarding a document. The interface can look like

public interface Parser {
   public String getContent(...);
   public Collection getStructuredContent(); //This can be metadata or other
content that user whant to extract (using XPATH, Regex etc.)
   public getLangage(...);
   public getMimeType(...);
   public String[] getOutLinks(...);
   public int getSize(...) ;
   public int getNbWords(...);
   Etc ...
}
I'm sorry for my bad english, If you don't understand, please let me know.
BR,
Rida Benjelloun

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message