manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafa Haro <rh...@apache.org>
Subject Re: File system continuous crawl settings
Date Sun, 10 May 2015 10:52:29 GMT
Hi Karl,

I was not meaning to request anything, just dumping some thoughts. I can
take care of the rest.

Thanks!!!

El sábado, 9 de mayo de 2015, Karl Wright <daddywri@gmail.com> escribió:

> Hi Rafa,
>
> Two points.
>
> First, Allesandro's case is arguably insolvable by any mechanism that
> doesn't involve a sidecar process, because whether or not a job is running
> or mcf is even up the credential tokens must be continually refreshed.  I
> don't know how to solve that within the canon of mcf.
>
> Second, connection management is really quite central to mcf.  Independent
> connection instances are the only way you can hope to do connection
> throttling across cluster, for instance.  Given that, you'd have to have a
> pretty compelling case to request a rearchitecture, no?
>
> So -- are you going to finish the work for connectors-1198, or should I?
>
> Karl
>
> Sent from my Windows Phone
> ------------------------------
> From: Rafa Haro
> Sent: 5/9/2015 6:35 AM
> To: user@manifoldcf.apache.org
> <javascript:_e(%7B%7D,'cvml','user@manifoldcf.apache.org');>
> Cc: Rafa Haro; Timo Selvaraj
> Subject: Re: File system continuous crawl settings
>
> Hi Karl,
>
> I understand. The thing I'd, as for example Alessandro also pointed out
> some days ago, it is not strange to find situations where you might want to
> initialize resources only once per job execution and that seems to be
> impossible right now with current architecture but it also seems to have a
> lot of sense to have that possibility.
>
> Should we consider to include that  functionality? Some initializations
> can be expensive and it is not possible always to use a singleton.
>
> Thanks Karl!
>
> El sábado, 9 de mayo de 2015, Karl Wright <daddywri@gmail.com
> <javascript:_e(%7B%7D,'cvml','daddywri@gmail.com');>> escribió:
>
>> Hi Rafa,
>>
>> The problem was twofold.
>>
>> As stated before, the manifoldcf model for managing connections is that
>> connection instances operate independently of each other.  If what is
>> required to set up the connection depends on the job, it defeats the whole
>> manifoldcf pooling management strategy, since connections are swapped
>> between jobs completely outside the control of the connector writer.  So
>> trying to be clever here buys you little.
>>
>> The actual failure also involved usage of variables which were
>> uninitialized.
>>
>> In other connectors where pooling can be defined at other levels than
>> just in mcf, the standard is to use a hardwired pool  size of 1 for those
>> cases.  See the jira connector, for example.  For searchblox, the only
>> parameters other than pool size that you set this way are socket and
>> connection timeout.  In every other connector we have these are connection
>> parameters, not specification parameters. I don't see any reason searchblox
>> should be different.
>>
>> Karl
>> Sent from my Windows Phone
>> ------------------------------
>> From: Rafa Haro
>> Sent: 5/9/2015 4:56 AM
>> To: user@manifoldcf.apache.org
>> Cc: Timo Selvaraj
>> Subject: Re: File system continuous crawl settings
>>
>> Hi Karl and Tim,
>>
>> Karl, you were too fast and didn't give me time to take a look to the
>> issue after confirming that it was an issue connector. Thanks for
>> addressing it anyway. I will take a look to your changes but the job
>> parameters should make more sense per job, not at connection configuration
>> because it customizes the pool of http connections to the SearchBlox
>> server. This could be redundant with the manifold thread management but the
>> idea was the threads to be using that pool and not to create a single
>> connection resource per thread.
>>
>> As we have observed before, we found changeling to create shared
>> resources for the whole job in the getsession method and tried to trick it
>> with class members variables as flags.
>>
>> Where was exactly the problem with the session management?
>>
>> Cheers,
>> Rafa
>>
>> El sábado, 9 de mayo de 2015, Karl Wright <daddywri@gmail.com> escribió:
>>
>>> Hi Timo,
>>>
>>> I've taken a deep look at the SearchBlox code and found a significant
>>> problem.  I've created a patch for you to address it, although it is not
>>> the final fix.  The patch should work on either 2.1 or 1.9.  See
>>> CONNECTORS-1198 for complete details.
>>>
>>> Please let me know ASAP if the patch does not solve your immediate
>>> problem, since I will be making other changes to the connector to bring it
>>> in line with ManifoldCF standards.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Fri, May 8, 2015 at 8:01 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> That error is what I was afraid of.
>>>>
>>>> We need the complete exception trace.  Can you find that and create a
>>>> ticket, including the complete trace?
>>>>
>>>> My apologies; the searchblox connector is a contribution which
>>>> obviously still has bugs.  With the trace though I should be able to get
>>>> you a patch.
>>>>
>>>> Karl
>>>>
>>>> Sent from my Windows Phone
>>>> ------------------------------
>>>> From: Timo Selvaraj
>>>> Sent: 5/8/2015 6:46 PM
>>>> To: Karl Wright
>>>> Cc: user@manifoldcf.apache.org
>>>>
>>>> Subject: Re: File system continuous crawl settings
>>>>
>>>> Hi Karl,
>>>>
>>>> The only error message which seems to be continuously thrown in
>>>> manifold log is :
>>>>
>>>> FATAL 2015-05-08 18:42:47,043 (Worker thread '40') - Error tossed: null
>>>> java.lang.NullPointerException
>>>>
>>>> I do notice that the file that needs to deleted is shown under the
>>>> Queue Status report and keeps jumping between “Processing” and “About
to
>>>> Process” statuses every 30 seconds.
>>>>
>>>> Timo
>>>>
>>>>
>>>> On May 8, 2015, at 1:40 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>> Hi Timo,
>>>>
>>>> As I said, I don't think your configuration is the source of the delete
>>>> issue. I suspect the searchblox connector.
>>>>
>>>> In the absence of a thread dump, can you look for exceptions in the
>>>> manifoldcf log?
>>>>
>>>> Karl
>>>>
>>>> Sent from my Windows Phone
>>>> ------------------------------
>>>> From: Timo Selvaraj
>>>> Sent: 5/8/2015 10:06 AM
>>>> To: user@manifoldcf.apache.org
>>>> Subject: Re: File system continuous crawl settings
>>>>
>>>> When I change the settings to the following, updated or modified
>>>> documents are now indexed but deleting the documents that are removed is
>>>> still an issue:
>>>>
>>>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5
>>>> minutesMaximum recrawl interval:10 minutesExpiration interval:InfinityReseed
>>>> interval:60 minutesNo scheduled run timesMaximum hop count for link
>>>> type 'child':UnlimitedHop count mode:Delete unreachable documents
>>>>
>>>> Do I need to set the reseed interval to Infinity?
>>>>
>>>> Any thoughts?
>>>>
>>>>
>>>> On May 8, 2015, at 6:18 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>> I just tried your configuration here.  A deleted document in the file
>>>> system was indeed picked up as expected.
>>>>
>>>> I did notice that your "expiration" setting is, essentially, cleaning
>>>> out documents at a rapid clip.  With this setting, documents will be
>>>> expired before they are recrawled.  You probably want one strategy or the
>>>> other but not both.
>>>>
>>>> As for why a deleted document is "stuck" in Processing: the only thing
>>>> I can think of is that the output connection you've chosen is having
>>>> trouble deleting the document from the index.  What output connector are
>>>> you using?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, May 8, 2015 at 4:36 AM, Timo Selvaraj <timo.selvaraj@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We are testing the continuous crawl feature for file system connector
>>>>> on a small folder to test if new documents are added to the folder, missing
>>>>> documents removed and modified documents updated are handled by the
>>>>> continuous crawl job:
>>>>>
>>>>> Here are the settings we use:
>>>>>
>>>>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5
>>>>> minutesMaximum recrawl interval:10 minutesExpiration interval:5
>>>>> minutesReseed interval:10 minutesNo scheduled run timesMaximum hop
>>>>> count for link type 'child':UnlimitedHop count mode:Delete
>>>>> unreachable documents
>>>>>
>>>>> Adding new documents seem to be getting picked up by the job however
>>>>> removal of a document or update to a document are not being picked up.
>>>>>
>>>>> Am I missing any settings for the deletions or updates? I do see the
>>>>> document that has been removed is showing as Processing under Queue Status
>>>>> and others are showing as Waiting for Processing.
>>>>>
>>>>> Any idea what setting is missing for the deletes/updates to be
>>>>> recognized and re-indexed?
>>>>>
>>>>> Thanks,
>>>>> Timo
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>

Mime
View raw message