nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evert Wagenaar <evert.wagen...@yahoo.com>
Subject Re: parsing a simple text node
Date Tue, 08 Feb 2011 11:57:45 GMT
Hi Jun, 

Could it be that the price is set by JavaScript at the moment of display in your browser?
In that case the price is actually in some datasource (xml) or a separate .js file. This is
sometimes done when pages need to be displayed in several browses like iPhone's and regular
browsers. 

Did you try using an XPath expression? in your case it would be //span@product-price-amount.
There are some good firefox addons to test XPaths on HTML. I use XPather. 

Regards, 

Evert 




Van: "Jun Yang" <juny78@gmail.com> 
Aan: dev@nutch.apache.org 
Verzonden: Dinsdag 8 februari 2011 09:16:50 
Onderwerp: parsing a simple text node 

Hi there, 

i am working on a plugin to fetch some structured information (e.g., product price) in web
pages, and I had some problem parsing the following simple node: 

< span class = "product-price-amount" > 
$27.00</ span > 
The parser first got the Node for "span", which has only one child node as a text Node. I
would assume this text Node has value "$27.00", but when I called getNodeValue() the return
value is empty. I forced this child node to be Text node and called getWholeText() but still
get empty return value. 

Does anyone know what's going on? It seems that the text "$27.00" seems to be missing from
the whole hierarchy. 

Jun 





      
Mime
View raw message