manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shinichiro Abe <shinichiro.ab...@gmail.com>
Subject Re: Treatment of protected files
Date Fri, 20 May 2011 15:44:50 GMT

On Solr side, ignoreTikaException flag is introduced since next version. SOLR-2480.
If ignoreTikaException is true, Solr responds successfully and Solr does not throw server
error(TikaException) . 
If ignoreTikaException is false(default), Solr responds server error(TikaException) in the
case like parse error. 

At the point of SOLR-2480 that is different from CONNECTORS-200,
if TikaException is threw, Solr can index the metadata of files while it ignores indexing
contents.
CONNECTORS-200 does not index the matadata and never ingest documents.

Shinichiro Abe


On 2011/05/19, at 22:59, Erlend Garåsen wrote:

> 
> Sure, I can test it tomorrow, unfortunately not right now. I'm leaving my office in 20
minutes, but I have plenty of time tomorrow.
> 
> Erlend
> 
> On 19.05.11 14.39, Karl Wright wrote:
>> I've also checked in the proposed change, if you care to try it.
>> We're having network issues here this morning so I can't seem to
>> update the ticket though.
>> 
>> Karl
>> 
>> On Thu, May 19, 2011 at 8:35 AM, Karl Wright<daddywri@gmail.com>  wrote:
>>> CONNECTORS-200 is the ticket.
>>> Karl
>>> 
>>> On Thu, May 19, 2011 at 8:04 AM, Karl Wright<daddywri@gmail.com>  wrote:
>>>> This should be enough.
>>>> 
>>>> I'll open a ticket.  The changes to the solr connector are trivial; I
>>>> can do them and check them in, if someone is willing to try it out for
>>>> real.
>>>> 
>>>> Karl
>>>> 
>>>> On Thu, May 19, 2011 at 6:11 AM, Erlend Garåsen<e.f.garasen@usit.uio.no>
 wrote:
>>>>> 
>>>>> Here's what I found in my simple history logs:
>>>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
>>>>> getting content for thmx and xps file types
>>>>> 
>>>>> So, yes, Tika exceptions are stored in the MCF logs, so I guess it should
be
>>>>> possible to find a workaround for this.
>>>>> 
>>>>> Erlend
>>>>> 
>>>>> On 19.05.11 12.00, Karl Wright wrote:
>>>>>> 
>>>>>> There was a Solr ticket created I believe by Shinichiro.
>>>>>> 
>>>>>> The question is whether the Solr 500 response has anything in its
body
>>>>>> that could help ManifoldCF recognize a Tika exception.  If not there
>>>>>> is little the Solr connector can do to detect this case.  The problem
>>>>>> is that you need to look in the Simple History to see what the
>>>>>> response actually is, and I don't think Shinichiro did that.
>>>>>> 
>>>>>> Karl
>>>>>> 
>>>>>> On Thu, May 19, 2011 at 4:42 AM, Erlend Garåsen<e.f.garasen@usit.uio.no>
>>>>>>  wrote:
>>>>>>> 
>>>>>>> Do we have an MCF ticket for this issue yet? Or is rather a Solr
issue?
>>>>>>> 
>>>>>>> I agree with Karl. We should look for a TikaException and then
tell MCF
>>>>>>> to
>>>>>>> skip affecting documents. But maybe this should just be a temporary
fix
>>>>>>> until it has been fixed in Solr Cell.
>>>>>>> 
>>>>>>> Exactly the same happens if Tika cannot parse a document which
it does
>>>>>>> not
>>>>>>> support. Solr/Solr Cell returns a 500 server error, causing MCF
to retry
>>>>>>> over and over again:
>>>>>>> [2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract
>>>>>>> 
>>>>>>> params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx}
>>>>>>> status=500 QTime=5
>>>>>>> [2011-05-18 17:39:39.102] {} 0 4
>>>>>>> [2011-05-18 17:39:39.103] org.apache.solr.common.SolrException:
>>>>>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException
while
>>>>>>> getting content for thmx and xps file types
>>>>>>> 
>>>>>>> And finally, the job just aborts:
>>>>>>> Exception tossed: Repeated service interruptions - failure processing
>>>>>>> document: Ingestion HTTP error code 500
>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
>>>>>>> service
>>>>>>> interruptions - failure processing document: Ingestion HTTP error
code
>>>>>>> 500
>>>>>>>        at
>>>>>>> 
>>>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630)
>>>>>>> Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>> Ingestion HTTP error code 500
>>>>>>>        at
>>>>>>> 
>>>>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362)
>>>>>>> 
>>>>>>> I guess I can find a workaround since I have created my own
>>>>>>> ExtractingRequestHandler in order to support language detection
etc., but
>>>>>>> I
>>>>>>> think MCF should act differently when the underlying cause is
a
>>>>>>> TikaException.
>>>>>>> 
>>>>>>> Erlend
>>>>>>> 
>>>>>>> 
>>>>>>> On 27.04.11 12.25, Karl Wright wrote:
>>>>>>>> 
>>>>>>>> If I recall, it treats the 400 response as meaning "this
document
>>>>>>>> should be skipped", and it treats the 500 response as meaning
"this
>>>>>>>> document should be retried because I have absolutely no idea
what
>>>>>>>> happened".  However, we could modify the code for the 500
response to
>>>>>>>> look at the content of the response as well, and look for
a string in
>>>>>>>> it that would give us a clue, such as "TikaException".  If
we see a
>>>>>>>> TikaException, we could have it conclude "this document should
be
>>>>>>>> skipped".  That was what I was thinking.
>>>>>>>> 
>>>>>>>> Karl
>>>>>>>> 
>>>>>>>> On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe
>>>>>>>> <shinichiro.abe.1@gmail.com>      wrote:
>>>>>>>>> 
>>>>>>>>> Hi.Thank you for your reply.
>>>>>>>>> 
>>>>>>>>> It seems that Solr.ExtractingRequestHandler responds
the same HTTP
>>>>>>>>> response(SERVER_ERROR( 500 )) at any time error occurs.
>>>>>>>>> I'll try to open a ticket for solr.
>>>>>>>>> 
>>>>>>>>> Is it correct that MCF re-try crawling was processed
when it receives
>>>>>>>>> 500
>>>>>>>>> level response, not 400 level response?
>>>>>>>>> 
>>>>>>>>> Thank you.
>>>>>>>>> Shinichiro Abe
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 2011/04/27, at 14:45, Karl Wright wrote:
>>>>>>>>> 
>>>>>>>>>> So the 500 error is occurring because Solr is throwing
an exception at
>>>>>>>>>> indexing time, is that correct?
>>>>>>>>>> 
>>>>>>>>>> If this is correct, then here's my take.  (1) A 500
error is a nasty
>>>>>>>>>> error that Solr should not be returning under normal
conditions.  (2)
>>>>>>>>>> A password-protected PDF is not what I would consider
exceptional, so
>>>>>>>>>> Tika should not be throwing an exception when it
sees it, merely (at
>>>>>>>>>> worst) logging an error and continuing.  However,
having said that,
>>>>>>>>>> output connectors in ManifoldCF can make the decision
to never retry
>>>>>>>>>> the document, by returning a certain status, provided
the connector
>>>>>>>>>> can figure out that the error warrants this treatment.
>>>>>>>>>> 
>>>>>>>>>> My suggestion is therefore the following.  First,
we should open a
>>>>>>>>>> ticket for Solr about this.  Second, if you can see
the error output
>>>>>>>>>> from the Simple History for a TikaException being
thrown in Solr, we
>>>>>>>>>> can look for that text in the response from Solr
and perhaps modify
>>>>>>>>>> the Solr Connector to detect the case.  If you could
open a ManifoldCF
>>>>>>>>>> ticket and include that text I'd be very grateful.
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> Karl
>>>>>>>>>> 
>>>>>>>>>> On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
>>>>>>>>>> <shinichiro.abe.1@gmail.com>      wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hello.
>>>>>>>>>>> 
>>>>>>>>>>> There are pdf and office files that are protected
by reading
>>>>>>>>>>> password.
>>>>>>>>>>> We do not have to read those files if we do not
know the password of
>>>>>>>>>>> files.
>>>>>>>>>>> 
>>>>>>>>>>> Now, MCF job starts to crawl the filesystem repository
and post to
>>>>>>>>>>> Solr.
>>>>>>>>>>> Document ingestion of non-protected files is
done successfully,
>>>>>>>>>>> but one of protected file is not done successfully
as far as the job
>>>>>>>>>>> is
>>>>>>>>>>> processed beyond Retry Limit.
>>>>>>>>>>> During that time, it is logging 500 result code
in simple history.
>>>>>>>>>>> (Solr throws TikaException caused by PDFBox or
apache poi as the
>>>>>>>>>>> reason
>>>>>>>>>>> that it does not read protected documents.)
>>>>>>>>>>> 
>>>>>>>>>>> When I ran that test by continuous clawing, not
by simple once
>>>>>>>>>>> crawling,
>>>>>>>>>>> the job was done halfway and logged the following:
>>>>>>>>>>> Error: Repeated service interruptions - failure
processing document:
>>>>>>>>>>> Ingestion HTTP error code 500
>>>>>>>>>>> the job tried to crawl that files many times.
>>>>>>>>>>> 
>>>>>>>>>>> It seems that a job takes a lot of time and costs
for treating
>>>>>>>>>>> protected files.
>>>>>>>>>>> So I want to find a way to skip quickly reading
those files.
>>>>>>>>>>> 
>>>>>>>>>>> In my survey:
>>>>>>>>>>> Hopfillers is not relevant.(right?)
>>>>>>>>>>> Then Tika, PDFBox, and POI have the mechanism
to decrypt protected
>>>>>>>>>>> files,
>>>>>>>>>>> but throw each another exception in the case
that given invalid
>>>>>>>>>>> password.
>>>>>>>>>>> It occurs to me that Solr throws another result
code when protected
>>>>>>>>>>> files are posted,
>>>>>>>>>>> as one idea apart from possibility or not.
>>>>>>>>>>> 
>>>>>>>>>>> Do you have any ideas?
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Shinichiro Abe
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Erlend Garåsen
>>>>>>> Center for Information Technology Services
>>>>>>> University of Oslo
>>>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968,
VIP:
>>>>>>> 31050
>>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Erlend Garåsen
>>>>> Center for Information Technology Services
>>>>> University of Oslo
>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050
>>>>> 
>>>> 
>>> 
> 
> 
> -- 
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050


Mime
View raw message