nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
Date Sat, 25 Apr 2015 16:50:39 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512589#comment-14512589
] 

Hudson commented on NUTCH-1997:
-------------------------------

FAILURE: Integrated in Nutch-trunk #3089 (See [https://builds.apache.org/job/Nutch-trunk/3089/])
NUTCH-1997: Fix for Add CBOR magic header to CommonCrawlDataDumper output contributed by Giuseppe
Totaro, and Luke Sh. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1676029)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java


> Add CBOR "magic header" to CommonCrawlDataDumper output
> -------------------------------------------------------
>
>                 Key: NUTCH-1997
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1997
>             Project: Nutch
>          Issue Type: Improvement
>          Components: tool
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.10
>
>         Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} wraps a single
string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, the first
byte of all files is "0x7F" (the first three bits are "011", that is the major type for strings,
and the following 5 bits are "11010", meaning a uint32_t encodes the length of following text),
and the following 4 bytes (single-precision float) encodes the right length of file (as described
in [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is currently included
into the file (a list of cbor tags is available [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]),
it would be great if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR
"magic header" ([Tag 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded
output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for supporting
me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message