nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2300) Fetcher to optionally save robots.txt
Date Fri, 19 Aug 2016 13:48:21 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428210#comment-15428210
] 

ASF GitHub Bot commented on NUTCH-2300:
---------------------------------------

GitHub user sebastian-nagel opened a pull request:

    https://github.com/apache/nutch/pull/141

    NUTCH-2300 Fetcher to optionally save robots.txt

    If the property fetcher.store.robotstxt is set to true, Fetcher saves the robots.txt
    response (URL and Content including HTTP protocol status and metadata) in the
    segment (subfolder content/). It does not add a fetch datum, simply because this
    avoids that the robots.txt URL slips into CrawlDb or gets indexed. The robots.txt
    can then be retrieved from the segment, e.g., by
    ```
    # inject http://nutch.apache.org/
    # generate
    # and fetch with
    bin/nutch fetch -Dfetcher.store.robotstxt=true -Dfetcher.store.content=true ...path_to_segment
    
    # dump segment (without -nocontent)
    bin/nutch readseg -dump ...path_to_segment ...path_to_dump
    cat ...path_to_dump/dump
    ...
    URL:: http://nutch.apache.org/robots.txt
    
    Content::
    Version: -1
    url: http://nutch.apache.org/robots.txt
    base: http://nutch.apache.org/robots.txt
    contentType: text/html
    metadata: nutch.fetch.time=1471612087645 Server=Apache/2.4.7 (Ubuntu) Connection=close
Content-Length=208 Date=Fri, 19 Aug 2016 13:08:07 GMT Content-Type=text/html; charset=iso-8859-1

    Content:
    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <html><head>
    <title>404 Not Found</title>
    </head><body>
    <h1>Not Found</h1>
    <p>The requested URL /robots.txt was not found on this server.</p>
    </body></html>
    ...
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sebastian-nagel/nutch SaveRobotsTxt

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #141
    
----
commit 6c9cca5e55e43458cbc5e59b8591e4d27ac425a2
Author: Sebastian Nagel <snagel@apache.org>
Date:   2016-05-25T12:24:11Z

    Allow Fetcher to optionally store robots.txt content (if property fetcher.store.robotstxt
== true).
    Improved RobotRulesParser command-line tool.

commit 264eea01a4d868578dcf641d6ce405444d276929
Author: Sebastian Nagel <snagel@apache.org>
Date:   2016-08-19T13:06:14Z

    Ignore robots.txt when parsing segment, refactored storing of robots.txt in FetcherThread

commit 33cdca76ac91a63445d4e761081e8124a23413af
Author: Sebastian Nagel <snagel@apache.org>
Date:   2016-08-19T13:32:34Z

    add hint and log warning that fetcher.store.robotstxt works only in combination with fetcher.store.content

----


> Fetcher to optionally save robots.txt
> -------------------------------------
>
>                 Key: NUTCH-2300
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2300
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher, protocol, segment
>            Reporter: Sebastian Nagel
>             Fix For: 1.13
>
>
> For debugging or archival purposes it may be useful to let Fetcher store the robots.txt
response (content and HTTP status). Of course, this should be optional and not by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message