nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
Date Tue, 22 Sep 2015 22:02:04 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903524#comment-14903524
] 

Sebastian Nagel commented on NUTCH-2110:
----------------------------------------

Ok, understood. One point to consider: shall all paginated documents be kept under the same
URL? As a batch crawler Nutch uses the URL in many places to uniquely identify content, meta
data, status information, indexed documents, etc.  Of course, the outlinks generated for page1
could be modified by adding a suffix which makes the URL unique. Only inside protocol-selenium
the suffix is removed to fetch the right page.

> Create the capability to provide seeds in the form of "url+xpath(including option to
enter seach terms).selenium" 
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2110
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2110
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher
>    Affects Versions: 1.10
>            Reporter: Asitang Mishra
>              Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including option to
enter seach terms).selenium" to be used by selenium protocols/plugins as urls/flow to reach
to a specific ajax based page or save the state of a selenium operation for the next fetching
round.
> Atleast, this should make nutch capable of distinguishing if a url should be opened using
the basic http, httpclient or selenium protocols. And provide the selenium protocol with basic
authentication capabilities based on the above ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message