nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karanjeet Singh <karan...@usc.edu>
Subject Re: Nutch: Tika Parser error while parsing an image
Date Fri, 08 Apr 2016 10:18:37 GMT
Hi Sebastian,

Thanks for your response on this.

I am developing a new plugin protocol-htmlunit [0] for Nutch where I am
facing this issue. Sorry, I didn't mention this in my previous email but I
wonder how this has affected Tika content type detection.

The plugin has not yet merged with Nutch but you can pick the updates and
enable the plugin on your local system to test.

The image parsing error is for all the images using protocol-htmlunit and
yes this doesn't come when using protocol-http protocol.

Any ideas what I am doing wrong? Appreciate your help.

[0]: https://github.com/apache/nutch/pull/100


P.S.: I am also developing interactive htmlunit handlers [1] (just like
Selenium) in case you are interested to have a look.

[1]:
https://github.com/karanjeets/FocusedCrawl-Weapons/tree/master/src/main/java/edu/usc/cs/ir/htmlunit/handler


Thanks & Regards,
Karanjeet Singh
C.S. Graduate Student
University of Southern California


On Thu, Mar 31, 2016 at 2:19 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi,
>
> I'm not able to reproduce the problem, at least,
> not with recent master (1.12 snapshot) and the default configuration:
>
> % bin/nutch parsechecker
> '
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM&s=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE&e=
> '
> fetching:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM&s=tcwDNixM_kDuk3n2rhA-viTbYxhEhyeSauPhPY5kg7w&e=
> ...
> ...
> parsing:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM&s=tcwDNixM_kDuk3n2rhA-viTbYxhEhyeSauPhPY5kg7w&e=
> ...
> contentType: image/jpeg
> ...
> Parse Metadata: X-Parsed-By=org.apache.tika.parser.jpeg.JpegParser
> Resolution Units=none File
> Modified Date=Thu Mar 31 23:04:11 CEST 2016 Comments=CREATOR: gd-jpeg v1.0
> (using IJG JPEG v80),
> quality = 75
>  Compression Type=Baseline Data Precision=8 bits Number of Components=3
> tiff:ImageLength=240
> Component 2=Cb component: Quantization table 1, Sampling factors 1 horiz/1
> vert w:comments=CREATOR:
> gd-jpeg v1.0 (using IJG JPEG v80), quality = 75
>  Component 1=Y component: Quantization table 0, Sampling factors 2 horiz/2
> vert Image Height=240
> pixels X Resolution=1 dot Image Width=240 pixels File Size=10351 bytes
> Component 3=Cr component:
> Quantization table 1, Sampling factors 1 horiz/1 vert comment=CREATOR:
> gd-jpeg v1.0 (using IJG JPEG
> v80), quality = 75
>  JPEG Comment=CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 75 File
> Name=apache-tika-8877046173076964154.tmp tiff:BitsPerSample=8
> tiff:ImageWidth=240
> Content-Type=image/jpeg Y Resolution=1 dot
>
> Is the error reproducible with parsechecker and the same config?
>
> The stack trace may indicate a version conflict of the commons-compress
> library.
> But the mime type is already not properly recognized.
> Which plugins are activated in nutch-site.xml?
>
> Sebastian
>
> On 03/31/2016 11:40 AM, Karanjeet Singh wrote:
> > Hello,
> >
> > I am getting below error *[0]* while parsing an image. It seems Tika is
> detecting the URL
> > (
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM&s=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE&e=
> )
> > as application/gzip instead of an image/jpg.
> >
> > Can anyone shed some light on this? Or please confirm if it is a bug.
> Meanwhile, I will be looking
> > into the code to see what is going wrong. I am working on the latest
> build.
> >
> > *[0]*:
> >
> > 2016-03-31 02:20:29,980 WARN  parse.ParseUtil - Error parsing
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM&s=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE&e=
> > with org.apache.nutch.parse.tika.TikaParser@48c56835
> >
> > java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:
> >
> org.apache.commons.compress.compressors.CompressorStreamFactory.<init>(Z)V
> >
> > at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> >
> > at java.util.concurrent.FutureTask.get(FutureTask.java:202)
> >
> > at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:171)
> >
> > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:95)
> >
> > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104)
> >
> > at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:45)
> >
> > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> >
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> >
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
> >
> > at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> >
> > at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >
> > at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >
> > at java.lang.Thread.run(Thread.java:745)
> >
> > Caused by: java.lang.NoSuchMethodError:
> >
> org.apache.commons.compress.compressors.CompressorStreamFactory.<init>(Z)V
> >
> > at
> org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:120)
> >
> > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:132)
> >
> > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> >
> > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> >
> > ... 4 more
> >
> > 2016-03-31 02:20:29,980 WARN  parse.ParseUtil - Unable to successfully
> parse content
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM&s=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE&e=
> > of type application/gzip
> >
> > 2016-03-31 02:20:29,980 WARN  parse.ParseSegment - Error parsing:
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg-3A&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM&s=_I0Ykwqpu1yhw0HEwodldz-Br4x5Lxd9lnfrSfcpJRA&e=
> > failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> successfully parse content
> >
> > 2016-03-31 02:20:29,981 INFO  cosine.CosineSimilarity - Setting score of
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM&s=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE&e=
> > to 0.0
> >
> > 2016-03-31 02:20:29,981 INFO  parse.ParseSegment - Parsed
> > (19ms):
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sturmgewehr.com_forums_uploads_monthly-5F2016-5F01_412098676.jpg.41e2d3562701152834b1c10b068388e3.thumb.jpg.fe9b6fad3ae9d371830b52db8c271189.jpg&d=CwIDaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=u7neGGUaVmQKNSLUqJ9zpA&m=5vneww7ikrGBSNwAf7cR-hFaE74g2SZBfLFQv5HlefM&s=iiJqYT5J3YOPt_OhEY5uPQLXIloaw87EPBVFbQlZAOE&e=
> >
> >
> > Thanks & Regards,
> > Karanjeet Singh
> > CS Graduate Student
> > University of Southern California
> > karanjes@usc.edu <mailto:karanjes@usc.edu> | +1-213-675-9583
> <tel:%2B1-213-675-9583>
> > ᐧ
>
>
ᐧ

Mime
View raw message