nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emmanuel Colin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.
Date Thu, 10 Jan 2013 10:12:12 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549176#comment-13549176
] 

Emmanuel Colin edited comment on NUTCH-978 at 1/10/13 10:10 AM:
----------------------------------------------------------------

It looks like there is a small misunderstanding, I must not have expressed myself very clearly:
I am not the author of this plugin.
The author of the code is the writer of the blog post I pointed to, so I would not presume
to submit his code ;)
What I can do is leave a comment on his blog suggesting him to submit his code, since from
our exchange it looks like such a submission would be welcome. (EDIT : done)
                
      was (Author: coline):
    It looks like there is a small misunderstanding, I must not have expressed myself very
clearly: I am not the author of this plugin.
The author of the code is the writer of the blog post I pointed to, so I would not presume
to submit his code ;)
What I can do is leave a comment on his blog suggesting him to submit his code, since from
our exchange it looks like such a submission would be welcome.
                  
> A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2012, mentor
>             Fix For: 2.2
>
>         Attachments: app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result_anchor.png,
app_screenshoot_configuration_result.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png,
for_GSoc.zip, [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page
by removing html tags and component like javascript and css and leaving the extracted text
to be stored on the index. Nutch by default doesn't have the capability to select certain
atomic element on an html page, like certain tags, certain content, some part of the page,
etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its
node. This branch and node could be extracted using XPath. XPath allowing us to select a certain
branch or node of an XML and therefore could be used to extract certain information and treat
it differently based on its content and the user requirements. Furthermore a web domain like
news website usually have a same html code structure for storing the information on its web
pages. This same html code structure could be parsed using the same XPath query and retrieve
the same content information element. All of the XPath query for selecting various content
could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from
those various source have the same html code structure, thus have to be threated differently
using the correct XPath Configuration. The selection of the correct XPath configuration could
be done automatically using regex by matching the url of the web page with valid url pattern
for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get
only certain information that user wants therefore making the index more accurate and its
content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements
on various news website for the purpose of document clustering. This includes a Configuration
Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message