nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chirag Chaman" <...@filangy.com>
Subject RE: both html parser have bug with javascript
Date Mon, 04 Jul 2005 13:14:21 GMT
Andrzej,

Thank you -- and here we were going nuts thinking the problem might have
been with the plugin!
Would it be possible to post the patch file of the changes once you have
made them as our version of Nutch is different from SVN.

Thankx again.

CC-
 

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Monday, July 04, 2005 6:05 AM
To: nutch-dev@lucene.apache.org
Subject: Re: both html parser have bug with javascript

Chirag Chaman wrote:
> Actually, I think the JavaScript is there as it's part of the HTML 
> page -- but it should not be part of the summaries.  Has anyone found 
> a solution to not showing the "JavaScript" or "text/css" -- that shows 
> up from time to time?

Summary is generated from parse_text data. So, the problem is already during
the parsing.

Actually, I think the problem is caused by my patch to DOMContentUtils ;-),
which adds script language, stylesheet type and so on to the output text.

 From your comments I gather that you'd rather not have it there - I'll fix
it.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|
||  |  Embedded Unix, System Integration http://www.sigram.com  Contact:
info at sigram dot com




Mime
View raw message