gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alparslan Avc─▒ <alparslan.a...@agmlab.com>
Subject Getting statistics about crawled pages
Date Wed, 19 Feb 2014 12:03:00 GMT
Hi all,

In order to get more info about structures of the pages we crawled, we 
need to save the HTML tags, attributes, and their values, I think. After 
Nutch provides this info, a data analysis process (with help of Pig, for 
example) can be run over the collected datum. (Google also saves this 
kind of info. You can see the stats in this link: 
https://developers.google.com/webmasters/state-of-the-web/) We can 
develop an HTML parser plug-in to provide such an improvement.

In the plug-in, we can iterate over the DOM root element, and save the 
tags, attributes and values into the WebPage object. We can create a new 
field for this, however this will change the data model. Instead, we can 
add the tag info into the metadata map. (We can also add a prefix to map 
key to differ the tag content data from other info.)

What do you think about this? Any comments or suggestions?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message