manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Diagnosing "REJECTED" documents in job history
Date Thu, 31 Jan 2013 23:09:31 GMT
I just chased down and fixed a problem in trunk.  ElasticSearch is now
returning a 201 code for successful indexing in some cases, and the
connector was not handling that as 'success'.

Karl


On Thu, Jan 31, 2013 at 10:24 AM, Karl Wright <daddywri@gmail.com> wrote:
> Please let me know if you see any problems.  I'll fix anything you
> find as quickly as I can.
>
> Karl
>
> On Thu, Jan 31, 2013 at 10:19 AM, Andrew Clegg <andrew.clegg@gmail.com> wrote:
>> Great, thanks, I'll give it a try.
>>
>> On 30 January 2013 18:52, Karl Wright <daddywri@gmail.com> wrote:
>>> I just checked in a refactoring to trunk that should improve Elastic
>>> Search error reporting significantly.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Jan 30, 2013 at 9:39 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>> I agree that the Elastic Search connector needs far better logging and
>>>> error handling.  CONNECTORS-629.
>>>>
>>>> Karl
>>>>
>>>> On Wed, Jan 30, 2013 at 9:27 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>> Nailed it with the help of wireshark! Turns out it was my fault -- I
>>>>> had set it up to use (i.e. create) an index called DocumentumRoW but
>>>>> it turns out ES index names must be all lowercase.
>>>>>
>>>>> Never knew that before.
>>>>>
>>>>> Slightly annoyed that ES didn't log that...
>>>>>
>>>>> Thanks again for your help Karl :-)
>>>>>
>>>>> My only request on the MCF front would be that it would be nice for
>>>>> the output connector to log the actual status code and content of a
>>>>> non-successful HTTP response.
>>>>>
>>>>>
>>>>> On 30 January 2013 14:21, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>> That information isn't being recorded in manifoldcf.log unfortunately
>>>>>> -- I included all that was there. And there are no exceptions in
>>>>>> elasticsearch.log either...
>>>>>>
>>>>>> I'll try running wireshark to see if I can follow the TCP stream.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 30 January 2013 14:16, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>> Ok, ElasticSearch is not happy about something when the document
is
>>>>>>> being posted.  The connector is seeing a non-200 HTTP response,
and
>>>>>>> throwing an exception as a result:
>>>>>>>
>>>>>>>       if (!checkResultCode(method.getStatusCode()))
>>>>>>>         throw new ManifoldCFException(getResultDescription());
>>>>>>>
>>>>>>> Presumably the exception message in the log tells us what that
HTTP
>>>>>>> code is, but you did not include that key info.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Wed, Jan 30, 2013 at 9:06 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>> Thanks for all your help Karl!
>>>>>>>>
>>>>>>>> It's 1.0.1 from the binary distro.
>>>>>>>>
>>>>>>>> And yes, it says "Connection working" when I view it.
>>>>>>>>
>>>>>>>> On 30 January 2013 14:03, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>> Ok, so let's back up a bit.
>>>>>>>>>
>>>>>>>>> First, which version of ManifoldCF is this?  I need to
know that
>>>>>>>>> before I can interpret the stack trace.
>>>>>>>>>
>>>>>>>>> Second, what do you see when you view the connection
in the crawler
>>>>>>>>> UI?  Does it say "Connection working", or something else,
and if so,
>>>>>>>>> what?
>>>>>>>>>
>>>>>>>>> I've created a ticket for better error reporting in this
connector -
>>>>>>>>> it was a contribution and AFAIK the error handling is
not very robust
>>>>>>>>> at this point, but I can fix that quickly with your help.
;-)
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>> On Wed, Jan 30, 2013 at 8:55 AM, Andrew Clegg <andrew.clegg@gmail.com>
wrote:
>>>>>>>>>> On 30 January 2013 13:33, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>>>>>
>>>>>>>>>>> So you saw events in the history which correspond
to these documents
>>>>>>>>>>> and which are of type "Indexation" that say "success"?
 If that is the
>>>>>>>>>>> case, then the ElasticSearch connector thinks
it handed the documents
>>>>>>>>>>> successfully to the ElasticSearch server.
>>>>>>>>>>
>>>>>>>>>> Ah, no, the activity is fetch rather than indexation.
e.g.
>>>>>>>>>>
>>>>>>>>>> 01-30-2013 13:08:16.217 fetch 09026205800698a9 Success
549541 361
>>>>>>>>>>
>>>>>>>>>> I don't see any history entries relating to indexing
as a specific
>>>>>>>>>> activity in its own right. Sorry, that was probably
a red herring, I
>>>>>>>>>> don't think it's getting that far.
>>>>>>>>>>
>>>>>>>>>> I just noticed that above all the "service interruption
reported"
>>>>>>>>>> warnings are some errors like this:
>>>>>>>>>>
>>>>>>>>>> ERROR 2013-01-30 13:44:15,356 (Worker thread '45')
- Exception tossed:
>>>>>>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection.call(ElasticSearchConnection.java:97)
>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex.<init>(ElasticSearchIndex.java:138)
>>>>>>>>>>         at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnector.addOrReplaceDocument(ElasticSearchConnector.java:322)
>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
>>>>>>>>>>         at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.DCTM.DCTM.processDocuments(DCTM.java:1820)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>>>>>>>>>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>>>>>>>>>
>>>>>>>>>> Sadly there's no description, just a stacktrace.
>>>>>>>>>>
>>>>>>>>>> I know the ES server is visible from the MCF server
-- actually
>>>>>>>>>> they're the same machine, and it's configured to
use
>>>>>>>>>> http://127.0.0.1:9200/ as the server URL. And I can
go to the command
>>>>>>>>>> line on that server and curl that URL successfully.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Mime
View raw message