commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <>
Subject [compress] FW: Tika content detection and crawled "remote" content
Date Wed, 05 Jul 2017 12:32:37 GMT
Fellow file-philes on [compress],
Sebastian Nagel has added file type id via Apache Tika to Common Crawl.  While Tika is not
100% accurate, this means that we have far better clarity on mime type than relying on the
http header+file suffix.  So, for testing purposes, you (or we over on Tika) can much more
easily gather a small test corpus of files by mime type.

Many, many thanks to Sebastian and Common Crawl!



-----Original Message-----
From: Sebastian Nagel [] 
Sent: Tuesday, July 4, 2017 6:18 AM
Subject: Tika content detection and crawled "remote" content


recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch)
with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage
and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types sent by the server in
the HTTP header and as detected by Tika 1.15 [2].  It shows that content types by Tika are
definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some
pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the
MIME at all:

            Tika-1.15                HTTP-Content-Type
1001968023  application/xhtml+xml    text/html
   2298146  application/rss+xml      text/xml
    617435  application/rss+xml      application/xml
    613525  text/html                unk
    361525  application/xhtml+xml    unk
    297707  application/rdf+xml      application/xml

However, there are a few dubious decisions, esp. the group of web server-side scripting languages
(ASP, JSP, PHP, ColdFusion, etc.):

         Tika-1.15         HTTP-Content-Type
2047739  text/x-php        text/html
 681629  text/asp          text/html
 193095  text/x-coldfusion text/html
 172318  text/aspdotnet    text/html
 139033  text/x-jsp        text/html
  38415  text/x-cgi        text/html
  32092  text/x-php        text/xml
  18021  text/x-perl       text/html

Of course, due to misconfigurations some servers may deliver the script files unmodified but
in general I wouldn't expect that this happens for millions of pages.  I've checked some of
the affected URLs:

- HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)

- (overlong) comment block at start of HTML which "masks" the HTML declaration

- HTML with some scripting fragments ("<?php?>") present:

- others are clearly HTML (looks more like a bug, at least, there is no simple explanation)

Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type
sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in Tika?

If anyone is interested in using the detected MIME types or anything else from Common Crawl
- I'm happy to help!  The URL index [4] contains now a new field "mime-detected" which makes
it easy to search or grep for confusion pairs.

Thanks and best,

[2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

View raw message