lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: demo HTML parser question
Date Thu, 23 Sep 2004 17:53:26 GMT wrote:
> We were originally attempting to use the demo html parser (Lucene 1.2), but as
> you know, its for a demo.  I think its threaded to optimize on time, to allow
> the calling thread to grab the title or top message even though its not done
> parsing the entire html document.

That's almost right.  I originally wrote it that way to avoid having to 
ever buffer the entire text of the document.  The document is indexed 
while it is parsed.  But, as observed, this has lots of problems and was 
probably a bad idea.

Could someone provide a patch that removes the multi-threading?  We'd 
simply use a StringBuffer in HTMLParser.jj to collect the text.  Calls 
to pipeOut.write() would be replaced with text.append().  Then have the 
HTMLParser's constructor parse the page before returning, rather than 
spawn a thread, and getReader() would return a StringReader.  The public 
API of HTMLParser need not change at all and lots of complex threading 
code would be thrown away.  Anyone interested in coding this?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message