No problem.

I checked in what I think is the right fix this morning.  Hope it looks ok to you?

Karl


On Sun, May 10, 2015 at 6:52 AM, Rafa Haro <rharo@apache.org> wrote:
Hi Karl,

I was not meaning to request anything, just dumping some thoughts. I can take care of the rest. 

Thanks!!! 


El sábado, 9 de mayo de 2015, Karl Wright <daddywri@gmail.com> escribió:
Hi Rafa,

Two points.

First, Allesandro's case is arguably insolvable by any mechanism that doesn't involve a sidecar process, because whether or not a job is running or mcf is even up the credential tokens must be continually refreshed.  I don't know how to solve that within the canon of mcf.

Second, connection management is really quite central to mcf.  Independent connection instances are the only way you can hope to do connection throttling across cluster, for instance.  Given that, you'd have to have a pretty compelling case to request a rearchitecture, no?

So -- are you going to finish the work for connectors-1198, or should I?

Karl

Sent from my Windows Phone

From: Rafa Haro
Sent: 5/9/2015 6:35 AM
To: user@manifoldcf.apache.org
Cc: Rafa Haro; Timo Selvaraj
Subject: Re: File system continuous crawl settings

Hi Karl, 

I understand. The thing I'd, as for example Alessandro also pointed out some days ago, it is not strange to find situations where you might want to initialize resources only once per job execution and that seems to be impossible right now with current architecture but it also seems to have a lot of sense to have that possibility.

Should we consider to include that  functionality? Some initializations can be expensive and it is not possible always to use a singleton.

Thanks Karl!

El sábado, 9 de mayo de 2015, Karl Wright <daddywri@gmail.com> escribió:
Hi Rafa,

The problem was twofold. 

As stated before, the manifoldcf model for managing connections is that connection instances operate independently of each other.  If what is required to set up the connection depends on the job, it defeats the whole manifoldcf pooling management strategy, since connections are swapped between jobs completely outside the control of the connector writer.  So trying to be clever here buys you little.

The actual failure also involved usage of variables which were uninitialized.

In other connectors where pooling can be defined at other levels than just in mcf, the standard is to use a hardwired pool  size of 1 for those cases.  See the jira connector, for example.  For searchblox, the only parameters other than pool size that you set this way are socket and connection timeout.  In every other connector we have these are connection parameters, not specification parameters. I don't see any reason searchblox should be different.

Karl
Sent from my Windows Phone

From: Rafa Haro
Sent: 5/9/2015 4:56 AM
To: user@manifoldcf.apache.org
Cc: Timo Selvaraj
Subject: Re: File system continuous crawl settings

Hi Karl and Tim,

Karl, you were too fast and didn't give me time to take a look to the issue after confirming that it was an issue connector. Thanks for addressing it anyway. I will take a look to your changes but the job parameters should make more sense per job, not at connection configuration because it customizes the pool of http connections to the SearchBlox server. This could be redundant with the manifold thread management but the idea was the threads to be using that pool and not to create a single connection resource per thread. 

As we have observed before, we found changeling to create shared resources for the whole job in the getsession method and tried to trick it with class members variables as flags.

Where was exactly the problem with the session management? 

Cheers,
Rafa

El sábado, 9 de mayo de 2015, Karl Wright <daddywri@gmail.com> escribió:
Hi Timo,

I've taken a deep look at the SearchBlox code and found a significant problem.  I've created a patch for you to address it, although it is not the final fix.  The patch should work on either 2.1 or 1.9.  See CONNECTORS-1198 for complete details.

Please let me know ASAP if the patch does not solve your immediate problem, since I will be making other changes to the connector to bring it in line with ManifoldCF standards.

Karl



On Fri, May 8, 2015 at 8:01 PM, Karl Wright <daddywri@gmail.com> wrote:
That error is what I was afraid of.

We need the complete exception trace.  Can you find that and create a ticket, including the complete trace?

My apologies; the searchblox connector is a contribution which obviously still has bugs.  With the trace though I should be able to get you a patch.

Karl

Sent from my Windows Phone

From: Timo Selvaraj
Sent: 5/8/2015 6:46 PM
To: Karl Wright
Cc: user@manifoldcf.apache.org

Subject: Re: File system continuous crawl settings

Hi Karl,

The only error message which seems to be continuously thrown in manifold log is :

FATAL 2015-05-08 18:42:47,043 (Worker thread '40') - Error tossed: null
java.lang.NullPointerException

I do notice that the file that needs to deleted is shown under the Queue Status report and keeps jumping between “Processing” and “About to Process” statuses every 30 seconds.

Timo


On May 8, 2015, at 1:40 PM, Karl Wright <daddywri@gmail.com> wrote:

Hi Timo,

As I said, I don't think your configuration is the source of the delete issue. I suspect the searchblox connector.

In the absence of a thread dump, can you look for exceptions in the manifoldcf log?

Karl

Sent from my Windows Phone

From: Timo Selvaraj
Sent: 5/8/2015 10:06 AM
To: user@manifoldcf.apache.org
Subject: Re: File system continuous crawl settings

When I change the settings to the following, updated or modified documents are now indexed but deleting the documents that are removed is still an issue:

Schedule type:Rescan documents dynamically
Minimum recrawl interval:5 minutesMaximum recrawl interval:10 minutes
Expiration interval:InfinityReseed interval:60 minutes
No scheduled run times
Maximum hop count for link type 'child':Unlimited
Hop count mode:Delete unreachable documents

Do I need to set the reseed interval to Infinity?

Any thoughts?


On May 8, 2015, at 6:18 AM, Karl Wright <daddywri@gmail.com> wrote:

I just tried your configuration here.  A deleted document in the file system was indeed picked up as expected.

I did notice that your "expiration" setting is, essentially, cleaning out documents at a rapid clip.  With this setting, documents will be expired before they are recrawled.  You probably want one strategy or the other but not both.

As for why a deleted document is "stuck" in Processing: the only thing I can think of is that the output connection you've chosen is having trouble deleting the document from the index.  What output connector are you using?

Karl


On Fri, May 8, 2015 at 4:36 AM, Timo Selvaraj <timo.selvaraj@gmail.com> wrote:
Hi,

We are testing the continuous crawl feature for file system connector on a small folder to test if new documents are added to the folder, missing documents removed and modified documents updated are handled by the continuous crawl job:

Here are the settings we use:

Schedule type:Rescan documents dynamically
Minimum recrawl interval:5 minutesMaximum recrawl interval:10 minutes
Expiration interval:5 minutesReseed interval:10 minutes
No scheduled run times
Maximum hop count for link type 'child':Unlimited
Hop count mode:Delete unreachable documents


Adding new documents seem to be getting picked up by the job however removal of a document or update to a document are not being picked up.

Am I missing any settings for the deletions or updates? I do see the document that has been removed is showing as Processing under Queue Status and others are showing as Waiting for Processing.

Any idea what setting is missing for the deletes/updates to be recognized and re-indexed?

Thanks,
Timo