nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Auro Miralles (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser
Date Tue, 05 Jan 2016 11:42:40 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082904#comment-15082904
] 

Auro Miralles commented on NUTCH-2168:
--------------------------------------

Hello. I have no idea which document fails... I can crawl without problems with index-html
plugin disabled, but nutch fails at the third iteration when i enable the plugin. Only two
urls in my seed.txt and ignore.external.links on true.

http://ujiapps.uji.es/
https://wiki.apache.org/nutch/

:~/ /bin/crawl urls/ testCrawl http://localhost:8983/solr/ 3
....
....
....
Parsing https://wiki.apache.org/nutch/bin/nutch%20mergelinkdb
Parsing https://wiki.apache.org/nutch/GettingNutchRunningWithDebian
Parsing http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/formacio/index.jpg
Parsing http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/compres/index.jpg
Parsing http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/productesfinancers/index.jpg
ParserJob: success
ParserJob: finished at 2016-01-05 12:27:42, time elapsed: 00:00:14
CrawlDB update for testCrawl
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2
-D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -all -crawlId
testCrawl
DbUpdaterJob: starting at 2016-01-05 12:27:42
DbUpdaterJob: updatinging all
DbUpdaterJob: finished at 2016-01-05 12:27:49, time elapsed: 00:00:06
Indexing testCrawl on SOLR index -> http://localhost:8983/solr/
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false
-D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/ -all -crawlId
testCrawl
IndexingJob: starting
SolrIndexerJob: java.lang.RuntimeException: job failed: name=[testCrawl]Indexer, jobid=job_local1207147570_0001
	at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
	at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
	at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
	at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

Error running:
  /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2
-D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/
-all -crawlId testCrawl
Failed with exit value 255.



HADOOP.LOG
....
....
....
2016-01-05 12:28:00,151 INFO  html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/alumnisauji/jasoc/alumnisaujipremium/instal_lacions/se/
2016-01-05 12:28:00,152 INFO  html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO  html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO  solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO  solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN  mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at 
char #137317, byte #139263)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class
java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #1373
17, byte #139263)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
        at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
        at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
        at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
        at org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
        at org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-01-05 12:28:01,605 ERROR indexer.IndexingJob - SolrIndexerJob: java.lang.RuntimeException:
job failed: name=[testCrawl]Indexer, jobid=job_local1207147570_0001
        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

> Parse-tika fails to retrieve parser
> -----------------------------------
>
>                 Key: NUTCH-2168
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2168
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.1
>            Reporter: Sebastian Nagel
>             Fix For: 2.3.1
>
>         Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types (PDF, xlsx,
...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type
application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse content
http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message