manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Treatment of protected files
Date Thu, 19 May 2011 12:35:52 GMT
CONNECTORS-200 is the ticket.
Karl

On Thu, May 19, 2011 at 8:04 AM, Karl Wright <daddywri@gmail.com> wrote:
> This should be enough.
>
> I'll open a ticket.  The changes to the solr connector are trivial; I
> can do them and check them in, if someone is willing to try it out for
> real.
>
> Karl
>
> On Thu, May 19, 2011 at 6:11 AM, Erlend Garåsen <e.f.garasen@usit.uio.no> wrote:
>>
>> Here's what I found in my simple history logs:
>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
>> getting content for thmx and xps file types
>>
>> So, yes, Tika exceptions are stored in the MCF logs, so I guess it should be
>> possible to find a workaround for this.
>>
>> Erlend
>>
>> On 19.05.11 12.00, Karl Wright wrote:
>>>
>>> There was a Solr ticket created I believe by Shinichiro.
>>>
>>> The question is whether the Solr 500 response has anything in its body
>>> that could help ManifoldCF recognize a Tika exception.  If not there
>>> is little the Solr connector can do to detect this case.  The problem
>>> is that you need to look in the Simple History to see what the
>>> response actually is, and I don't think Shinichiro did that.
>>>
>>> Karl
>>>
>>> On Thu, May 19, 2011 at 4:42 AM, Erlend Garåsen<e.f.garasen@usit.uio.no>
>>>  wrote:
>>>>
>>>> Do we have an MCF ticket for this issue yet? Or is rather a Solr issue?
>>>>
>>>> I agree with Karl. We should look for a TikaException and then tell MCF
>>>> to
>>>> skip affecting documents. But maybe this should just be a temporary fix
>>>> until it has been fixed in Solr Cell.
>>>>
>>>> Exactly the same happens if Tika cannot parse a document which it does
>>>> not
>>>> support. Solr/Solr Cell returns a 500 server error, causing MCF to retry
>>>> over and over again:
>>>> [2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract
>>>>
>>>> params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx}
>>>> status=500 QTime=5
>>>> [2011-05-18 17:39:39.102] {} 0 4
>>>> [2011-05-18 17:39:39.103] org.apache.solr.common.SolrException:
>>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
>>>> getting content for thmx and xps file types
>>>>
>>>> And finally, the job just aborts:
>>>> Exception tossed: Repeated service interruptions - failure processing
>>>> document: Ingestion HTTP error code 500
>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
>>>> service
>>>> interruptions - failure processing document: Ingestion HTTP error code
>>>> 500
>>>>        at
>>>>
>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630)
>>>> Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>> Ingestion HTTP error code 500
>>>>        at
>>>>
>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362)
>>>>
>>>> I guess I can find a workaround since I have created my own
>>>> ExtractingRequestHandler in order to support language detection etc., but
>>>> I
>>>> think MCF should act differently when the underlying cause is a
>>>> TikaException.
>>>>
>>>> Erlend
>>>>
>>>>
>>>> On 27.04.11 12.25, Karl Wright wrote:
>>>>>
>>>>> If I recall, it treats the 400 response as meaning "this document
>>>>> should be skipped", and it treats the 500 response as meaning "this
>>>>> document should be retried because I have absolutely no idea what
>>>>> happened".  However, we could modify the code for the 500 response to
>>>>> look at the content of the response as well, and look for a string in
>>>>> it that would give us a clue, such as "TikaException".  If we see a
>>>>> TikaException, we could have it conclude "this document should be
>>>>> skipped".  That was what I was thinking.
>>>>>
>>>>> Karl
>>>>>
>>>>> On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe
>>>>> <shinichiro.abe.1@gmail.com>    wrote:
>>>>>>
>>>>>> Hi.Thank you for your reply.
>>>>>>
>>>>>> It seems that Solr.ExtractingRequestHandler responds the same HTTP
>>>>>> response(SERVER_ERROR( 500 )) at any time error occurs.
>>>>>> I'll try to open a ticket for solr.
>>>>>>
>>>>>> Is it correct that MCF re-try crawling was processed when it receives
>>>>>> 500
>>>>>> level response, not 400 level response?
>>>>>>
>>>>>> Thank you.
>>>>>> Shinichiro Abe
>>>>>>
>>>>>>
>>>>>> On 2011/04/27, at 14:45, Karl Wright wrote:
>>>>>>
>>>>>>> So the 500 error is occurring because Solr is throwing an exception
at
>>>>>>> indexing time, is that correct?
>>>>>>>
>>>>>>> If this is correct, then here's my take.  (1) A 500 error is
a nasty
>>>>>>> error that Solr should not be returning under normal conditions.
 (2)
>>>>>>> A password-protected PDF is not what I would consider exceptional,
so
>>>>>>> Tika should not be throwing an exception when it sees it, merely
(at
>>>>>>> worst) logging an error and continuing.  However, having said
that,
>>>>>>> output connectors in ManifoldCF can make the decision to never
retry
>>>>>>> the document, by returning a certain status, provided the connector
>>>>>>> can figure out that the error warrants this treatment.
>>>>>>>
>>>>>>> My suggestion is therefore the following.  First, we should
open a
>>>>>>> ticket for Solr about this.  Second, if you can see the error
output
>>>>>>> from the Simple History for a TikaException being thrown in Solr,
we
>>>>>>> can look for that text in the response from Solr and perhaps
modify
>>>>>>> the Solr Connector to detect the case.  If you could open a
ManifoldCF
>>>>>>> ticket and include that text I'd be very grateful.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Karl
>>>>>>>
>>>>>>> On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
>>>>>>> <shinichiro.abe.1@gmail.com>    wrote:
>>>>>>>>
>>>>>>>> Hello.
>>>>>>>>
>>>>>>>> There are pdf and office files that are protected by reading
>>>>>>>> password.
>>>>>>>> We do not have to read those files if we do not know the
password of
>>>>>>>> files.
>>>>>>>>
>>>>>>>> Now, MCF job starts to crawl the filesystem repository and
post to
>>>>>>>> Solr.
>>>>>>>> Document ingestion of non-protected files is done successfully,
>>>>>>>> but one of protected file is not done successfully as far
as the job
>>>>>>>> is
>>>>>>>> processed beyond Retry Limit.
>>>>>>>> During that time, it is logging 500 result code in simple
history.
>>>>>>>> (Solr throws TikaException caused by PDFBox or apache poi
as the
>>>>>>>> reason
>>>>>>>> that it does not read protected documents.)
>>>>>>>>
>>>>>>>> When I ran that test by continuous clawing, not by simple
once
>>>>>>>> crawling,
>>>>>>>> the job was done halfway and logged the following:
>>>>>>>> Error: Repeated service interruptions - failure processing
document:
>>>>>>>> Ingestion HTTP error code 500
>>>>>>>> the job tried to crawl that files many times.
>>>>>>>>
>>>>>>>> It seems that a job takes a lot of time and costs for treating
>>>>>>>> protected files.
>>>>>>>> So I want to find a way to skip quickly reading those files.
>>>>>>>>
>>>>>>>> In my survey:
>>>>>>>> Hopfillers is not relevant.(right?)
>>>>>>>> Then Tika, PDFBox, and POI have the mechanism to decrypt
protected
>>>>>>>> files,
>>>>>>>> but throw each another exception in the case that given invalid
>>>>>>>> password.
>>>>>>>> It occurs to me that Solr throws another result code when
protected
>>>>>>>> files are posted,
>>>>>>>> as one idea apart from possibility or not.
>>>>>>>>
>>>>>>>> Do you have any ideas?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Shinichiro Abe
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Mime
View raw message