nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: both html parser have bug with javascript
Date Mon, 04 Jul 2005 10:04:58 GMT
Chirag Chaman wrote:
> Actually, I think the JavaScript is there as it's part of the HTML page --
> but it should not be part of the summaries.  Has anyone found a solution to
> not showing the "JavaScript" or "text/css" -- that shows up from time to
> time?

Summary is generated from parse_text data. So, the problem is already 
during the parsing.

Actually, I think the problem is caused by my patch to DOMContentUtils 
;-), which adds script language, stylesheet type and so on to the output 
text.

 From your comments I gather that you'd rather not have it there - I'll 
fix it.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message