lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 10:17:46 GMT
Karl Koch wrote:

>I appologise in advance, if some of my writing here has been said before.
>The last three answers to my question have been suggesting pattern matching
>solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
>is something I cannot use since I work with Java 1.1 on a PDA.
I see,

In this case you can read line by line your HTML file and then write 
something like this:

String line;
int startPos, endPos;
StringBuffer text = new StringBuffer();
while((line = reader.readLine()) != null   ){
    startPos = line.indexOf(">");
    endPos = line.indexOf("<");
    if(startPos >0 && endPos > startPos)
          text.append(line.substring(startPos, endPos));

This is just a sample code that should work if you have just one tag per 
line in the HTML file.
This can be a start point for you.

  Hope it helps,



>I am wondering if somebody knows a piece of simple sourcecode with low
>requirement which is running under this tense specification.
>Thank you all,
>>No one has yet mentioned using ParserDelegator and ParserCallback that 
>>are part of HTMLEditorKit in Swing.  I have been successfully using 
>>these classes to parse out the text of an HTML file.  You just need to 
>>extend HTMLEditorKit.ParserCallback and override the various methods 
>>that are called when different tags are encountered.
>>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
>>>Three HTML parsers(Lucene web application
>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>>Lucene FAQ
>>>1.3.27.Which is the best?Can it filter tags that are
>>>auto-created by MS-word 'Save As HTML files' function?
>>Bill Tschumy
>>Otherwise -- Austin, TX
>>To unsubscribe, e-mail:
>>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message