tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Tika parsing corrupt mp3
Date Thu, 05 Aug 2010 21:19:08 GMT
Hi André,

Yes, please, file an issue in JIRA and point at the mp3 file and the test case that failed.
Thanks so much!

Cheers,
Chris



On 8/5/10 8:52 AM, "André Ricardo" <andric87@gmail.com> wrote:

Hello,

I was trying some mp3s in Tika coming from Nutch 0.9/1.0 samples and with "A
corrupt MP3 file that has been truncated half way through the ID3v2 frames"
returned this:

$ java -jar tika-app-0.7.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@1bf3d87
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:138)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:169)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
    at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
    at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:128)
    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
    ... 3 more

Also tried with the latest trunk from github reproducing the problem:

$ java -jar tika-app-0.8-SNAPSHOT.jar -v -m
~/nutch-0.9/src/plugin/parse-mp3/sample/test.mp3
Exception in thread "main" org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.mp3.Mp3Parser@e79839
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:169)
    at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:110)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:193)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:72)
Caused by: java.io.IOException: Tried to read 259186 bytes, but only 65526
bytes present
    at org.apache.tika.parser.mp3.ID3v2Frame.readFully(ID3v2Frame.java:160)
    at org.apache.tika.parser.mp3.ID3v2Frame.<init>(ID3v2Frame.java:110)
    at
org.apache.tika.parser.mp3.ID3v2Frame.createFrameIfPresent(ID3v2Frame.java:81)
    at
org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:133)
    at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:64)
    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:163)
    ... 3 more

The mp3 is here:
http://github.com/apache/nutch/raw/tags/release-1.0/src/plugin/parse-mp3/sample/test.mp3

All the other mp3 samples were parsed well by Tika.

Should I open an issue in Jira? And if so, would you consider this a bug or
an improvement?

André Ricardo



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message