tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juri Linkov <juri.lin...@gmail.com>
Subject Disable zip decompression
Date Fri, 08 Mar 2013 12:07:51 GMT
Hello,

Using Tika to extract metadata from sequence alignment BAM files with the help
of SAMTools http://samtools.sourceforge.net/ and its Java implementation Picard
leads to the following problem: the BAM format is GZIP-based,
so Tika decompresses the BAM files before sending them to the BAM parser.
Usually this is the right thing to do since most parsers expect
uncompressed input,
but for BAM files this is a disservice because the BAM parser requires that
input BAM files should be in their original compressed format.

I found one possible solution: declare its MIME type as a subclass of gzip:

  <mime-type type="application/bam">
    <sub-class-of type="application/x-gzip" />
    <_comment>SAMtools BAM format</_comment>
    <glob pattern="*.bam" />
  </mime-type>

In this case the default detection prefers the specialized
specific type over more generic supertype, thus avoiding
ZIP detection.

Disabling the decompression this way seems too implementation-dependent.
Wouldn't this cause some side-effects?
Are there other preferable ways to disable the decompression and
to configure the priority for conflict resolution of two similar MIME types?

-- 
Best regards,
Juri

Mime
View raw message