tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject TIKA-420 patch for boilerplate removal
Date Sat, 10 Jul 2010 00:23:27 GMT
I've submitted a revised patch (https://issues.apache.org/jira/browse/TIKA-420 
), and had one key question.

Currently the BoilerpipeContentHandler calls a delegate  
ContentHandler, but it only makes the following calls to the delegate:

startDocument();

then for each text block...

	startElement("p");
	characters(...);
	endElement("p");

endDocument();

This means that you don't get valid XHTML from the handler, which I  
think is OK (versus parsers, which must generate valid XHTML).

But I could easily add dummy tags for html and body - would that be  
better?

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message