lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Koch" <>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 10:05:53 GMT
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.

I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.

Thank you all,

> No one has yet mentioned using ParserDelegator and ParserCallback that 
> are part of HTMLEditorKit in Swing.  I have been successfully using 
> these classes to parse out the text of an HTML file.  You just need to 
> extend HTMLEditorKit.ParserCallback and override the various methods 
> that are called when different tags are encountered.
> On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
> > Three HTML parsers(Lucene web application
> > demo,CyberNeko HTML Parser,JTidy) are mentioned in
> > Lucene FAQ
> > 1.3.27.Which is the best?Can it filter tags that are
> > auto-created by MS-word 'Save As HTML files' function?
> -- 
> Bill Tschumy
> Otherwise -- Austin, TX
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Sparen beginnt mit GMX DSL:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message