nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Giuseppe Totaro (JIRA)" <>
Subject [jira] [Updated] (NUTCH-1975) New configuration for CommonCrawlDataDumper tool
Date Thu, 02 Apr 2015 19:13:53 GMT


Giuseppe Totaro updated NUTCH-1975:
    Attachment: NUTCH-1975.v03.patch

Patch v03 adds support for filename too long. More in detail, file extension is truncated
if it is more than {{MAX_LENGTH_OF_EXTENSION}} as made in other methods of {{DumpFileUtil}}.

By the way, file extension refers to the text after the last dot in the url string. This part
can be either the actual extension of the file or other text (e.g., text after the last dot
in the query part, if any). However, SHA1 digest is calculated against the original (not truncated)
Thanks [~chrismattmann] for testing this new configuration.

> New configuration for CommonCrawlDataDumper tool
> ------------------------------------------------
>                 Key: NUTCH-1975
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: tool
>    Affects Versions: 1.9
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NUTCH-1975.patch, NUTCH-1975.v02.patch, NUTCH-1975.v03.patch
> Hi all, you can find in attachment a new patch including support for new options for
> In particultar, new options are passed to {{CommonCrawlFormat}} object (which provides
methods to create JSON output) using a configuration object ({{CommonCrawlConfig}}).
> In particular, in this patch {{CommonCrawlDataDumper}} provides support for the following
> * {{-SimpleDataFormat}}: enables timestamps in GMT epoche (milliseconds) format.
> * {{-epochFilename}}: files extracted will be organized in a reversed-NDS tree based
on the FQDN of the webpage, followed by a SHA1 hash of the complete URL. Scraped data will
be stored in these directories as individual GMT-timestamped files using "epoche time (in
milliseconds)" plus file extension.
> * {{-jsonArray}}: organizes both request and response headers into a JSON array instead
of using a JSON sub-object.
> *{{-reverseKey}}: enables to use the same layout as described for -epochFilename option,
with underscore in place of directory separators.
> You can use the options above in addition to the options already supported, as described
in the [Nutch wiki|] page.
> This patch starts from [NUTCH-1974|].
> Thanks [~chrismattmann] and [~annieburgess] for supporting me on this work.

This message was sent by Atlassian JIRA

View raw message