nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chirag Chaman" <...@filangy.com>
Subject RE: both html parser have bug with javascript
Date Tue, 05 Jul 2005 20:38:30 GMT
Andrzej,

Thankx -- This works!!!


-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Monday, July 04, 2005 11:55 AM
To: nutch-dev@lucene.apache.org
Subject: Re: both html parser have bug with javascript

Chirag Chaman wrote:
> Andrzej,
> 
> Thank you -- and here we were going nuts thinking the problem might 
> have been with the plugin!
> Would it be possible to post the patch file of the changes once you 
> have made them as our version of Nutch is different from SVN.

I suggest keeping around a vanilla version, and porting diffs to your tree,
otherwise you will end up with more and more out-of-sync version...

The change itself is trivial (available as 'svn diff -r 179640
DOMContentUtils.java'):

Index: DOMContentUtils.java
===================================================================
--- DOMContentUtils.java        (revision 179640)
+++ DOMContentUtils.java        (working copy)
@@ -102,25 +102,9 @@
                                               boolean abortOnNestedAnchors,
                                               int anchorDepth) {
      if ("script".equalsIgnoreCase(node.getNodeName())) {
-      Node n = node.getAttributes().getNamedItem("language");
-      if (n != null) {
-        String text = n.getNodeValue();
-        sb.append(text);
-      }
        return false;
      }
      if ("style".equalsIgnoreCase(node.getNodeName())) {
-      Node n = node.getAttributes().getNamedItem("rel");
-      if (n != null) {
-        String text = n.getNodeValue();
-        sb.append(text);
-      }
-      n = node.getAttributes().getNamedItem("type");
-      if (n != null) {
-        String text = n.getNodeValue();
-        if (sb.length() > 0) sb.append(", ");
-        sb.append(text);
-      }
        return false;
      }
      if (abortOnNestedAnchors &&
"a".equalsIgnoreCase(node.getNodeName())) {


> Thankx again.

You're welcome.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Mime
View raw message