manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: File system continuous crawl settings
Date Sat, 09 May 2015 08:23:00 GMT
Hi Timo,

I've taken a deep look at the SearchBlox code and found a significant
problem.  I've created a patch for you to address it, although it is not
the final fix.  The patch should work on either 2.1 or 1.9.  See
CONNECTORS-1198 for complete details.

Please let me know ASAP if the patch does not solve your immediate problem,
since I will be making other changes to the connector to bring it in line
with ManifoldCF standards.

Karl



On Fri, May 8, 2015 at 8:01 PM, Karl Wright <daddywri@gmail.com> wrote:

> That error is what I was afraid of.
>
> We need the complete exception trace.  Can you find that and create a
> ticket, including the complete trace?
>
> My apologies; the searchblox connector is a contribution which obviously
> still has bugs.  With the trace though I should be able to get you a patch.
>
> Karl
>
> Sent from my Windows Phone
> ------------------------------
> From: Timo Selvaraj
> Sent: 5/8/2015 6:46 PM
> To: Karl Wright
> Cc: user@manifoldcf.apache.org
>
> Subject: Re: File system continuous crawl settings
>
> Hi Karl,
>
> The only error message which seems to be continuously thrown in manifold
> log is :
>
> FATAL 2015-05-08 18:42:47,043 (Worker thread '40') - Error tossed: null
> java.lang.NullPointerException
>
> I do notice that the file that needs to deleted is shown under the Queue
> Status report and keeps jumping between “Processing” and “About to Process”
> statuses every 30 seconds.
>
> Timo
>
>
> On May 8, 2015, at 1:40 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> Hi Timo,
>
> As I said, I don't think your configuration is the source of the delete
> issue. I suspect the searchblox connector.
>
> In the absence of a thread dump, can you look for exceptions in the
> manifoldcf log?
>
> Karl
>
> Sent from my Windows Phone
> ------------------------------
> From: Timo Selvaraj
> Sent: 5/8/2015 10:06 AM
> To: user@manifoldcf.apache.org
> Subject: Re: File system continuous crawl settings
>
> When I change the settings to the following, updated or modified documents
> are now indexed but deleting the documents that are removed is still an
> issue:
>
> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5
> minutesMaximum recrawl interval:10 minutesExpiration interval:InfinityReseed
> interval:60 minutesNo scheduled run timesMaximum hop count for link type
> 'child':UnlimitedHop count mode:Delete unreachable documents
>
> Do I need to set the reseed interval to Infinity?
>
> Any thoughts?
>
>
> On May 8, 2015, at 6:18 AM, Karl Wright <daddywri@gmail.com> wrote:
>
> I just tried your configuration here.  A deleted document in the file
> system was indeed picked up as expected.
>
> I did notice that your "expiration" setting is, essentially, cleaning out
> documents at a rapid clip.  With this setting, documents will be expired
> before they are recrawled.  You probably want one strategy or the other but
> not both.
>
> As for why a deleted document is "stuck" in Processing: the only thing I
> can think of is that the output connection you've chosen is having trouble
> deleting the document from the index.  What output connector are you using?
>
> Karl
>
>
> On Fri, May 8, 2015 at 4:36 AM, Timo Selvaraj <timo.selvaraj@gmail.com>
> wrote:
>
>> Hi,
>>
>> We are testing the continuous crawl feature for file system connector on
>> a small folder to test if new documents are added to the folder, missing
>> documents removed and modified documents updated are handled by the
>> continuous crawl job:
>>
>> Here are the settings we use:
>>
>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5
>> minutesMaximum recrawl interval:10 minutesExpiration interval:5 minutesReseed
>> interval:10 minutesNo scheduled run timesMaximum hop count for link type
>> 'child':UnlimitedHop count mode:Delete unreachable documents
>>
>> Adding new documents seem to be getting picked up by the job however
>> removal of a document or update to a document are not being picked up.
>>
>> Am I missing any settings for the deletions or updates? I do see the
>> document that has been removed is showing as Processing under Queue Status
>> and others are showing as Waiting for Processing.
>>
>> Any idea what setting is missing for the deletes/updates to be recognized
>> and re-indexed?
>>
>> Thanks,
>> Timo
>>
>
>
>
>

Mime
View raw message