nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke sh (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (NUTCH-1997) Add CBOR "magic header" to CommonCrawlDataDumper output
Date Fri, 24 Apr 2015 02:42:38 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Luke sh updated NUTCH-1997:
---------------------------
    Comment: was deleted

(was: Notes:
The attached cbor file contains both magic bytes for type xhtml and type cbor, with priority
40 on application/cbor, we will have the following issues

Problem1: Magic priority 40.
	The application/xhtml+xml has higher priority(50) than application/cbor (40); [I don't know
who (and why) assigned 40 to cbor];  So if xhtml gets read and compared first,  cbor will
not even be placed in the magic estimation list because it has low priority. Based on the
tests, it turns out that it is true that xhtml gets read and compared first with the input
file, so any type below the priority 50 will be disregarded. 


Problem2: again magic priority with 50.
	In Tika, given a file dumped by the nutch dumper tool,  both types (xhtml and cbor) will
be selected as candidate mime types and they will be put in the magic estimation list; since
xhtml type gets read first, it is placed atop the cbor; in order to break that tie, tika will
rely on the decision from the extension method. If the extension method fails to detect the
type(for now, let's ignore metadata hint method for simplicity but the same applies to it
too), then xhtml gets returned eventually.

My pull request to be sent : I am going to set the magic priority of cbor type to 50 the same
as xhtml, because it would probably be risky to discard any one of the estimated types without
going consult the extension method.
)

> Add CBOR "magic header" to CommonCrawlDataDumper output
> -------------------------------------------------------
>
>                 Key: NUTCH-1997
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1997
>             Project: Nutch
>          Issue Type: Improvement
>          Components: tool
>            Reporter: Giuseppe Totaro
>            Priority: Minor
>         Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} wraps a single
string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, the first
byte of all files is "0x7F" (the first three bits are "011", that is the major type for strings,
and the following 5 bits are "11010", meaning a uint32_t encodes the length of following text),
and the following 4 bytes (single-precision float) encodes the right length of file (as described
in [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is currently included
into the file (a list of cbor tags is available [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]),
it would be great if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR
"magic header" ([Tag 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded
output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for supporting
me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message