nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Yang <jun...@gmail.com>
Subject Re: parsing a simple text node
Date Wed, 09 Feb 2011 09:07:33 GMT
Hi Evert,

Thanks for reply. Actually I left out some details in my original email.

If I looked at it through firebug, the element looks like:
<span class="product-price-amount" style="visibility: visible; opacity: 1;">
   <cufon class="cufon cufon-canvas" alt="$27.00" style="width: 101px;
height: 20px;">
       <canvas width="126" height="25" style="width: 126px; height: 25px;
top: -3px; left: -2px;">
       </canvas>
        <cufontext>$27.00</cufontext>
    </cufon>
</span>

But when I looked at it through "VIew Source", it becomes:

<span class="product-price-amount">

             $27.00</span>

When I passed it, it looks like I am parsing the second one (I cannot get
<cufon> node at all).

Does this mean it's dynamically generated by JS?

Jun


On Tue, Feb 8, 2011 at 3:57 AM, Evert Wagenaar <evert.wagenaar@yahoo.com>wrote:

> Hi Jun,
>
> Could it be that the price is set by JavaScript at the moment of display in
> your browser? In that case the price is actually in some datasource (xml) or
> a separate .js file. This is sometimes done when pages need to be displayed
> in several browses like iPhone's and regular browsers.
>
> Did you try using an XPath expression? in your case it would be
> //span@product-price-amount. There are some good firefox addons to test
> XPaths on HTML. I use XPather.
>
> Regards,
>
> Evert
>
>
>
> ------------------------------
> *Van: *"Jun Yang" <juny78@gmail.com>
> *Aan: *dev@nutch.apache.org
> *Verzonden: *Dinsdag 8 februari 2011 09:16:50
> *Onderwerp: *parsing a simple text node
>
>
> Hi there,
>
> i am working on a plugin to fetch some structured information (e.g.,
> product price) in web pages, and I had some problem parsing the following
> simple node:
>
> <span class="product-price-amount">
>
>              $27.00</span>
>
> The parser first got the Node for "span", which has only one child node as
> a text Node. I would assume this text Node has value "$27.00", but when I
> called getNodeValue() the return value is empty. I forced this child node to
> be Text node and called getWholeText() but still get empty return value.
>
> Does anyone know what's going on? It seems that the text "$27.00" seems to
> be missing from the whole hierarchy.
>
> Jun
>
>
>
>

Mime
View raw message