nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Update of "GoogleSummerOfCode/PrecisionDataExtractor" by AmmarShadiq
Date Thu, 28 Jan 2016 03:01:32 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "GoogleSummerOfCode/PrecisionDataExtractor" page has been changed by AmmarShadiq:

New page:

||'''Title :'''||||GSOC 2016 Proposal||
||'''Issue (Formerly):'''|||| [[|NUTCH-987
- A Plugin for extracting certain element of a web page on html page parsing]]||
||'''Student :'''||||Ammar Shadiq -||
||'''Mentors :'''||||||

=== Abstract ===

Nutch use parse-html plugin to parse web pages, it process the contents of the web page by
removing html tags and component like javascript and css and leaving the extracted text to
be stored on the index. Nutch by default doesn't have the capability to select certain atomic
element on an html page, like certain tags, certain content, some part of the page, etc.
A html page have a tree-like xml pattern with html tag as its branch and text as its node.
This branch and node could be extracted using XPath. XPath allowing us to select a certain
branch or node of an XML and therefore could be used to extract certain information and treat
it differently based on its content and the user requirements. Furthermore a web domain like
news website usually have a same html code structure for storing the information on its web
pages. This same html code structure could be parsed using the same XPath query and retrieve
the same content information element. All of the XPath query for selecting various content
could be stored on a XPath Configuration File.
The purpose of nutch are for various web source, not all of the web page retrieved from those
various source have the same html code structure, thus have to be threated differently using
the correct XPath Configuration. The selection of the correct XPath configuration could be
done automatically using regex by matching the url of the web page with valid url pattern
for that xpath configuration. This automatic mechanism allow the user of nutch to process
various web page and get only certain information that user wants therefore making the index
more accurate and its content more flexible.

=== Introduction ===
To be added
=== Timeline: ===
To be added
=== Reference: ===


=== Reports ===
To be added
=== Documentation ===
To be added
=== Source Code ===
To be added
=== Jira Issues ===
To be added

View raw message