manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafa Haro <rh...@apache.org>
Subject Re: File system continuous crawl settings
Date Sat, 09 May 2015 08:56:21 GMT
Hi Karl and Tim,

Karl, you were too fast and didn't give me time to take a look to the issue
after confirming that it was an issue connector. Thanks for addressing it
anyway. I will take a look to your changes but the job parameters should
make more sense per job, not at connection configuration because it
customizes the pool of http connections to the SearchBlox server. This
could be redundant with the manifold thread management but the idea was the
threads to be using that pool and not to create a single connection
resource per thread.

As we have observed before, we found changeling to create shared resources
for the whole job in the getsession method and tried to trick it with class
members variables as flags.

Where was exactly the problem with the session management?

Cheers,
Rafa

El sábado, 9 de mayo de 2015, Karl Wright <daddywri@gmail.com> escribió:

> Hi Timo,
>
> I've taken a deep look at the SearchBlox code and found a significant
> problem.  I've created a patch for you to address it, although it is not
> the final fix.  The patch should work on either 2.1 or 1.9.  See
> CONNECTORS-1198 for complete details.
>
> Please let me know ASAP if the patch does not solve your immediate
> problem, since I will be making other changes to the connector to bring it
> in line with ManifoldCF standards.
>
> Karl
>
>
>
> On Fri, May 8, 2015 at 8:01 PM, Karl Wright <daddywri@gmail.com
> <javascript:_e(%7B%7D,'cvml','daddywri@gmail.com');>> wrote:
>
>> That error is what I was afraid of.
>>
>> We need the complete exception trace.  Can you find that and create a
>> ticket, including the complete trace?
>>
>> My apologies; the searchblox connector is a contribution which obviously
>> still has bugs.  With the trace though I should be able to get you a patch.
>>
>> Karl
>>
>> Sent from my Windows Phone
>> ------------------------------
>> From: Timo Selvaraj
>> Sent: 5/8/2015 6:46 PM
>> To: Karl Wright
>> Cc: user@manifoldcf.apache.org
>> <javascript:_e(%7B%7D,'cvml','user@manifoldcf.apache.org');>
>>
>> Subject: Re: File system continuous crawl settings
>>
>> Hi Karl,
>>
>> The only error message which seems to be continuously thrown in manifold
>> log is :
>>
>> FATAL 2015-05-08 18:42:47,043 (Worker thread '40') - Error tossed: null
>> java.lang.NullPointerException
>>
>> I do notice that the file that needs to deleted is shown under the Queue
>> Status report and keeps jumping between “Processing” and “About to Process”
>> statuses every 30 seconds.
>>
>> Timo
>>
>>
>> On May 8, 2015, at 1:40 PM, Karl Wright <daddywri@gmail.com
>> <javascript:_e(%7B%7D,'cvml','daddywri@gmail.com');>> wrote:
>>
>> Hi Timo,
>>
>> As I said, I don't think your configuration is the source of the delete
>> issue. I suspect the searchblox connector.
>>
>> In the absence of a thread dump, can you look for exceptions in the
>> manifoldcf log?
>>
>> Karl
>>
>> Sent from my Windows Phone
>> ------------------------------
>> From: Timo Selvaraj
>> Sent: 5/8/2015 10:06 AM
>> To: user@manifoldcf.apache.org
>> <javascript:_e(%7B%7D,'cvml','user@manifoldcf.apache.org');>
>> Subject: Re: File system continuous crawl settings
>>
>> When I change the settings to the following, updated or modified
>> documents are now indexed but deleting the documents that are removed is
>> still an issue:
>>
>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5
>> minutesMaximum recrawl interval:10 minutesExpiration interval:InfinityReseed
>> interval:60 minutesNo scheduled run timesMaximum hop count for link type
>> 'child':UnlimitedHop count mode:Delete unreachable documents
>>
>> Do I need to set the reseed interval to Infinity?
>>
>> Any thoughts?
>>
>>
>> On May 8, 2015, at 6:18 AM, Karl Wright <daddywri@gmail.com
>> <javascript:_e(%7B%7D,'cvml','daddywri@gmail.com');>> wrote:
>>
>> I just tried your configuration here.  A deleted document in the file
>> system was indeed picked up as expected.
>>
>> I did notice that your "expiration" setting is, essentially, cleaning out
>> documents at a rapid clip.  With this setting, documents will be expired
>> before they are recrawled.  You probably want one strategy or the other but
>> not both.
>>
>> As for why a deleted document is "stuck" in Processing: the only thing I
>> can think of is that the output connection you've chosen is having trouble
>> deleting the document from the index.  What output connector are you using?
>>
>> Karl
>>
>>
>> On Fri, May 8, 2015 at 4:36 AM, Timo Selvaraj <timo.selvaraj@gmail.com
>> <javascript:_e(%7B%7D,'cvml','timo.selvaraj@gmail.com');>> wrote:
>>
>>> Hi,
>>>
>>> We are testing the continuous crawl feature for file system connector on
>>> a small folder to test if new documents are added to the folder, missing
>>> documents removed and modified documents updated are handled by the
>>> continuous crawl job:
>>>
>>> Here are the settings we use:
>>>
>>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5
>>> minutesMaximum recrawl interval:10 minutesExpiration interval:5 minutesReseed
>>> interval:10 minutesNo scheduled run timesMaximum hop count for link
>>> type 'child':UnlimitedHop count mode:Delete unreachable documents
>>>
>>> Adding new documents seem to be getting picked up by the job however
>>> removal of a document or update to a document are not being picked up.
>>>
>>> Am I missing any settings for the deletions or updates? I do see the
>>> document that has been removed is showing as Processing under Queue Status
>>> and others are showing as Waiting for Processing.
>>>
>>> Any idea what setting is missing for the deletes/updates to be
>>> recognized and re-indexed?
>>>
>>> Thanks,
>>> Timo
>>>
>>
>>
>>
>>
>

Mime
View raw message