lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <>
Subject Re: which HTML parser is better?
Date Tue, 01 Feb 2005 09:54:33 GMT
Jingkang Zhang wrote:

>Three HTML parsers(Lucene web application
>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>Lucene FAQ
>1.3.27.Which is the best?Can it filter tags that are
>auto-created by MS-word 'Save As HTML files' function?

maybe you can try this library...

I use the following code to get the text from HTML files,
it was not intensively tested, but it works.

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.Translate;

Parser parser = new Parser(source.getAbsolutePath());
NodeIterator iter = parser.elements();
while (iter.hasMoreNodes()) {
Node element = (Node) iter.nextNode();
//System.out.println("1:" + element.getText());
String text = Translate.decode(element.toPlainTextString());
if (Utils.notEmptyString(text))


>Do You Yahoo!?
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message