tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephane Bastian <stephane_bast...@hotmail.com>
Subject Re: RFE: adding a ParserFactory class
Date Fri, 24 Oct 2008 10:54:58 GMT

Jukka Zitting wrote:
> Hi,
> On Thu, Oct 23, 2008 at 5:32 PM, Stephane Bastian
> <stephane_bastian@hotmail.com> wrote:
> > However, a ParserFactory class (which doesn't exist yet) would really help
> > us here and could provide public method(s) to do what's currently done
> > internally by the class AutoDetectParser
> You should be able to achieve this functionality by overriding the
> getParser(Metadata) method in CompositeParser (that AutoDetectParser
> inherits).
> Alteratively you could simply modify the Tika configuration and pass
> the modified configuration to the AutoDetectParser instance.
I could certainly subclass CompositeParser and override 
getParser(metadata), but it seems odd that Tika doesn't provide an easy 
way to get a Parser based upon a Stream, documentName, and contentType. 
As a matter of fact, by looking closer into Tika's internal, we'll 
simply need to add the following new method in MimeTypes:

MimeType mimeType = MimeTypes.getMimeType(inputStream , metadata);

We can then easily get the parser via the class TikaConfig (even though 
it's not optimal since Tikaconfig.getDefaultConfig() creates a new 
instance each time it's called. BTW, I can help you here as well in case 
you want to make TikaConfig immutable. Just let me know what you had in 
mind and I can work on it)

Going back to the original question, don't you feel that it is a common 
use case to be able to get a Parser from a Stream and metadata?

> More generally, is there a specific reason why you need custom
> processing for HTML?
We are using Tika to get metadata, as you may have guessed :), and to 
extract other data as well. For instance in the case of Html, we were 
planning on using the content handler to do screen scrapping, based upon 
the known structure of the html document.
We were also planning on using the content handler to extract links that 
have specific names, or links that come from specific tags (such a HREF, 
Scripts, Img...). In our case we don't want all the links but only some 
of same based on some internal logic that we'll put in the content 
handler. And we can't rely on the full text because name of links (and 
other information we need to filter-on) are missing.

However, this morning I realized that the contentHandler of the Html 
parser filters tags such as Divs, Spans and such and doesn't  return the 
original body of the document, this is a bummer... Therefore we are out 
of luck and can't do screen scrapping because the structure of the 
document we get has been altered by Tika.

Since the Html Parsers uses cyberneko, the contentHanlder is already 
returning proper XHtml right?
Can't the content handler just returned the original document structure 

All the best,

Stephane Bastian
> BR,
> Jukka Zitting

View raw message