nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Giuseppe Totaro (JIRA)" <>
Subject [jira] [Created] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
Date Wed, 22 Apr 2015 18:34:00 GMT
Giuseppe Totaro created NUTCH-1997:

             Summary: Add CBOR "magic header" to CommonCrawlDataDumper output
                 Key: NUTCH-1997
             Project: Nutch
          Issue Type: Bug
          Components: tool
            Reporter: Giuseppe Totaro

For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} wraps a single
string value, representing the JSON text, into CBOR. 
For instance, using the Unix {{hexdump}} tool, we can see that, as expected, the first byte
of all files is "0x7F" (the first three bits are "011", that is the major type for strings,
and the following 5 bits are "11010", meaning a uint32_t encodes the length of following text),
and the following 4 bytes (single-precision float) encodes the right length of file (as described
in [RFC7049|]). Therefore, a CBOR tag is currently included
into the file (a list of cbor tags is available [here|]).
In order to add support for CBOR detection using Apache Tika (as described in [TIKA-1610|]),
it would be great if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR
"magic header" ([Tag 55799|]) to CBOR-encoded
output files. 
Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for supporting
me on this work.

This message was sent by Atlassian JIRA

View raw message