nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks
Date Tue, 20 Dec 2011 11:07:31 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173077#comment-13173077
] 

Hudson commented on NUTCH-1184:
-------------------------------

Integrated in nutch-trunk-maven #69 (See [https://builds.apache.org/job/nutch-trunk-maven/69/])
    Renamed FetcherStatus to FetcherOutlinks for the new outlinks section of NUTCH-1184
NUTCH-1184 Fetcher to parse and follow Nth degree outlinks

markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221194
Files : 
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

markus : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1221181
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java

                
> Fetcher to parse and follow Nth degree outlinks
> -----------------------------------------------
>
>                 Key: NUTCH-1184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1184
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch, NUTCH-1184-1.5-3.patch,
NUTCH-1184-1.5-4.patch, NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch, NUTCH-1184-1.5-9-ParseOutputFormat.patch,
NUTCH-1185-1.5-6.patch, NUTCH-1185-1.5-7.patch, NUTCH-1185-1.5-8.patch, NUTCH-1185-1.5-9.patch
>
>
> Fetcher improvements to parse and follow outlinks up to a specified depth. The number
of outlinks to follow can be decreased by depth using a divisor. This patch introduces three
new configuration directives:
> {code}
> <property>
>   <name>fetcher.follow.outlinks.depth</name>
>   <value>-1</value>
>   <description>(EXPERT)When fetcher.parse is true and this value is greater than
0 the fetcher will extract outlinks
>   and follow until the desired depth is reached. A value of 1 means all generated pages
are fetched and their first degree
>   outlinks are fetched and parsed too. Be careful, this feature is in itself agnostic
of the state of the CrawlDB and does not
>   know about already fetched pages. A setting larger than 2 will most likely fetch home
pages twice in the same fetch cycle.
>   It is highly recommended to set db.ignore.external.links to true to restrict the outlink
follower to URL's within the same
>   domain. When disabled (false) the feature is likely to follow duplicates even when
depth=1.
>   A value of -1 of 0 disables this feature.
>   </description>
> </property>
> <property>
>   <name>fetcher.follow.outlinks.num.links</name>
>   <value>4</value>
>   <description>(EXPERT)The number of outlinks to follow when fetcher.follow.outlinks.depth
is enabled. Be careful, this can multiply
>   the total number of pages to fetch. This works with fetcher.follow.outlinks.depth.divisor,
by default settings the followed outlinks
>   at depth 1 is 8, not 4.
>   </description>
> </property>
> <property>
>   <name>fetcher.follow.outlinks.depth.divisor</name>
>   <value>2</value>
>   <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links per fetcher.follow.outlinks.depth.
This decreases the number
>   of outlinks to follow by increasing depth. The formula used is: outlinks = floor(divisor
/ depth * num.links). This prevents
>   exponential growth of the fetch list.
>   </description>
> </property>
> {code}
> Please, do not use this unless you know what you're doing. This feature does not consider
the state of the CrawlDB nor does it consider generator settings such as limiting the number
of pages per (domain|host|ip) queue. It is not polite to use this feature with high settings
as it can fetch many pages from the same domain including duplicates.
> Also, this feature will _not_ work if fetcher.parse is disabled. With parsing enabled
you might want to consider not to store downloaded content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message