tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeroen van Vianen (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-448) Tika FLVParser hangs
Date Tue, 29 Jun 2010 17:23:51 GMT
Tika FLVParser hangs

                 Key: TIKA-448
                 URL: https://issues.apache.org/jira/browse/TIKA-448
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.7
         Environment: Linux JDK 1.6u13, Nutch 1.1
            Reporter: Jeroen van Vianen

I am crawling a site with Nutch and creating an index using SOLR.

After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump shows:

"Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FilterInputStream.skip(FilterInputStream.java:125)
        at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
        at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

The only reason I see why the code might be stuck there is when skip(datalen - skiplen) returns
0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line 246:

                // Tag was not metadata, skip over data we cannot handle
                for (int skiplen = 0; skiplen < datalen;) {
                    long currentSkipLen = datainput.skip(datalen - skiplen);
                    skiplen += currentSkipLen;

As I don't know which FLV is downloaded that caused the problem I cannot easily create a testcase.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message