lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anatharaman, Srinatha (Contractor)" <Srinatha_Ananthara...@comcast.com>
Subject RE: DataImportHandler - Unable to load Tika Config Processing Document # 1
Date Thu, 09 Feb 2017 14:38:35 GMT
Shawn,

Thanks again for your input

As I said in my last email I was successfully completed this in Solr standalone
My requirement is, to index a emails which is already converted to a text file(There are no
attachments), Once these text files are indexed Solr search result should bring me back the
entire text file as it is, I am able to achieve this in Solr Standalone
For testing my code in SolrCloud I just kept a small file with 3 characters in it , Solr does
not throw any error but also not indexing the file

I tried below approaches
1. Issue with Dataimporthandler -- Zookeeper is not able to read tikaConfig.conf file at run
time
2. Issue with Flume SolrSink -- No error shown, it is not indexing but I see once in a while
it indexes though I did not make any code changes

As you mentioned I never saw Solr crashing or eating up CPU, RAM. The file which I am indexing
is very small { it has ABC \n DEF}
My worry is Solr is not throwing any error, I kept the Log level to TRACE

Thanks & Regards,
~Sri



-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
Sent: Wednesday, February 08, 2017 4:15 PM
To: solr-user@lucene.apache.org
Cc: Shawn Heisey <apache@elyograg.org>
Subject: RE: DataImportHandler - Unable to load Tika Config Processing Document # 1

> Thank you I will follow Erick's steps
> BTW I am also trying to ingesting using Flume , Flume uses Morphlines 
> along with Tika Even Flume SolrSink will have the same issue?

Yes, when using Tika you run the risk of it choking on a document, eating CPU and/or RAM until
everything dies. This is also true when you run it standalone. The problem is usually caused
by PDF and Office documents that are unusual, corrupt or incomplete (e.g. truncated in size)
or extremely large. But even ordinary HTML can get you into trouble due to extreme sizes or
very deep nested elements.

But, in general, it is not a problem you will experience frequently. We operate broad and
large scale web crawlers, ingesting all kinds of bad stuff all the time. The trick to avoid
problems is running each Tika parse in a separate thread, have a timer and kill the thread
if it reaches a limit. It can still go wrong, but trouble is very rare.

Running it standalone and talking to it over network is safest, but not very portable/easy
distributable on Hadoop or other platforms.

Mime
View raw message