nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Asitang Mishra (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2091) Increase robustness and crawling versatility of Nutch for the Deep Web
Date Mon, 28 Sep 2015 22:24:06 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Asitang Mishra updated NUTCH-2091:
----------------------------------
    Priority: Major  (was: Minor)

> Increase robustness and crawling versatility of Nutch for the Deep Web
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-2091
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2091
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.10
>            Reporter: Asitang Mishra
>              Labels: memex, nutch
>
> Nutch fails to grab a page or crawl in a manner that is more productive in certain cases.
This issue is to discuss those specific cases and try to generalize them into Nutch to make
it even more robust and productive.
> I came across three websites and got many issues. I have toned down those issues into
fine points.
> 1. Some websites detect that the crawler is not a browser (marketwired) (cookie validations)
and send it to the first page again and again.
> 2. Some data behind a click (detect which clicks: javascript void) of 'a tag' that is
not a link exactly (an improvement for the selenium plugin)
> 3. When clicked something on a page and the page changed, how to get back the page before
clicking further (can’t obviously look for a back button or cross button. Can save the old
state juxtapose with new info and only take the extra info)
> 4. Differentiate between a navigation link and a common link in a forum page so that
both links can be used differently to decide the progress of the crawler (nav links decide
the rounds and other links we can go one round)
> 5. Bring the capability of changing # to ? (pataxia.com). Right now url normalization
completely removes the part after # thinking that it's a simple anchor tag.
> 6. Easy route-decision in property file to decide how the fetcher will behave (instead
of going all BFS or DFS, there should be a away to make it go DEPTH-LIMITED search. Esp good
for forums and the likes of it. And users can give some known inputs like depth etc. to direct
the crawler if they know something specific about the site)
> 7. A forum can be roughly generalized into: a meta topic page (no nav links) -> post
list (with nav links) -> post page (with nav links) : How to make nutch aware of this structure/heirachy.
If manually give simple clues as well. Can be seen as an extension of the last point.
> 8. Sometimes even nav links are not actual links but ajax requests.
> NOTE: Nav links (definition here): the structure on a web page (like a forum) which gives
us an option to go to various pages by numbers or next, previous, first and or last pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message