tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-448) Tika FLVParser hangs
Date Wed, 30 Jun 2010 09:28:50 GMT

    [ https://issues.apache.org/jira/browse/TIKA-448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883858#action_12883858
] 

Jukka Zitting commented on TIKA-448:
------------------------------------

The InputStream.skip() method can always return 0 if it wants, see IO-203 for related discussion.

It might be easiest to simply always read() the tag content into memory instead of trying
to skip() it. The performance and memory overhead shouldn't be too high.

> Tika FLVParser hangs
> --------------------
>
>                 Key: TIKA-448
>                 URL: https://issues.apache.org/jira/browse/TIKA-448
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Linux JDK 1.6u13, Nutch 1.1
>            Reporter: Jeroen van Vianen
>         Attachments: FLVParser.patch
>
>
> I am crawling a site with Nutch and creating an index using SOLR.
> After happy crawling for a couple of hours, my Nutch Parse phase hangs. A thread dump
shows:
> "Thread-12" prio=10 tid=0xb4974000 nid=0x1b1b runnable [0xb4a50000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.FilterInputStream.skip(FilterInputStream.java:125)
>         at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:246)
>         at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:85)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:41)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> The only reason I see why the code might be stuck there is when skip(datalen - skiplen)
returns 0 for whatever reason in org.apache.tika.parser.video.FLVParser.parse around line
246:
>                 // Tag was not metadata, skip over data we cannot handle
>                 for (int skiplen = 0; skiplen < datalen;) {
>                     long currentSkipLen = datainput.skip(datalen - skiplen);
>                     skiplen += currentSkipLen;
>                 }
> As I don't know which FLV was downloaded that caused the problem I cannot easily create
a testcase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message