manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF two server setup
Date Fri, 23 Mar 2018 12:56:44 GMT
Hi Shashank,

As I mentioned earlier, file-based synchronization has been deprecated.  We
strongly recommend that you use Zookeeper-based synchronization.

I am very confused that you claim you can run jobs on specific cluster
members.  Job work is distributed among all cluster members, and you should
see the same jobs no matter which tomcat webapp you go to when you view the
jobs, since they are stored in the database.  The global database
configuration should be in Zookeeper, and the same Zookeeper instance
should be referenced by all cluster members.

In any case, if what you were trying to achieve by all of this was parallel
execution of jobs, you probably did all that work based on a false
assumption.  ManifoldCF will crawl jobs in parallel even with only one
agents process, but there will be a delay before the "second" job's
documents get served.  It is not a question of having multiple cluster
members; it is because ManifoldCF puts its job queue in the database.
Documents are given a "docpriority", which is a number, at the time they
are queued.  The query that pulls documents out of the queue for servicing
orders documents by docpriority.  What that means in practice is that when
you start your second job, ALL the documents that were queued for
processing must be processed before any new documents from the second job
get looked at.  This is, unfortunately, unavoidable.  You can, however,
reset the document priorities for a job by pausing it and resuming it -- so
if you start your second job, and then pause and restart the first, the
documents for the first job get reprioritized.

Reprioritization is expensive when the job queue is large, so it is
unlikely we'd consider "automatically" reprioritizing all documents
whenever a job is started.

Hope this helps,

Karl


On Fri, Mar 23, 2018 at 8:24 AM, Shashank Raj <shashank.raj2009@gmail.com>
wrote:

> Hi Karl,
>                 We followed your documentation and made a multi node setup
> both with file based synchronisation and zoo keeper based one. With zk
> based setup, we found that if we run two jobs in two seperate tomcat
> processes, only one job will pickup and post records. The other job will
> begin to work only if we pause the first one. Is this the implementation of
> multi process model? In our case both the tomcat processes should crawl and
> send documents parallelly.
> Also we found that the performance of file based synchronisation was not
> as good as the zk based one.
>
> Thanks and regards,
> Shashank
>
> On 13-Mar-2018 12:48 PM, "Karl Wright" <daddywri@gmail.com> wrote:
>
>> Hi Raj,
>>
>> First, I'd start by running the multiprocess example on ONE machine with
>> multiple processes.  That's what the multiprocess-file-example
>> demonstrates, although it can be easily generalized to multiple machines,
>> PROVIDED there is a shared file system available, like NFS.  If not, you
>> must use the Zookeeper deployment model if there are multiple machines.
>> The file synch has been deprecated and you will likely find it quite hard
>> to work with in a multi-machine environment.
>>
>> The basic way you work with the examples is to use them on a single
>> machine, get them working Initially, and then port one change at a time.
>> Use the scripts provided.to start the database instance, initialize the
>> database, and start the various processes.  THEN, when you are satisfied
>> with how that works, you can start making changes.  The changes are, in
>> order:
>>
>> - Using Postgresql rather than HSQLDB
>> - Using Tomcat rather than Jetty
>> - Using multiple machines, rather than one
>>
>> To answer your specific questions:
>>
>> (1) The files described are in common for all the examples, and are a
>> level above where you are looking.  From the example directories, you can
>> find them under ../web (or ../web-proprietary).
>> (2) Yes, once you set up your connection to Postgresql in properties.xml,
>> you DO need to run initialize-database, or the schema will not be created.
>> (3) When you start different agents processes, even on different
>> machines, each one must have its own ID.  The start scripts demonstrate how
>> you do that.
>>
>> Karl
>>
>>
>> On Tue, Mar 13, 2018 at 2:22 AM, Shashank Raj <shashank.raj2009@gmail.com
>> > wrote:
>>
>>> Hi Karl,
>>>             In the documentation for "Simplified multiprocess model
>>> using file based synchronisation", it is indicated that the war files
>>> should be taken from "web" folder of multiprocess-file-example. But there
>>> is no such folder or file. Can we get some inputs on where do we need to
>>> take war files from in this case?
>>>
>>> Regarding database , in the steps you have asked us to run
>>> start-database and initialize-database script files but we have deployed it
>>> using pgsql and database is getting created and initialized automatically
>>> with single process file example for now.
>>> Now we are switching to multiprocess model. Do we still need to run
>>> those scripts.
>>>
>>> And should we run start-agent in one server and start-agent2 in another
>>> server?
>>>
>>>
>>> On 20-Feb-2018 9:21 PM, "Karl Wright" <daddywri@gmail.com> wrote:
>>>
>>>> Hi Shashank,
>>>>
>>>> You can have multiple servers running against the same database, BUT if
>>>> you do so, they must be individually configured to have their own IDs, and
>>>> they must share locks and by extension, must use the same zookeeper.
>>>> See multiprocess-zk-example in the binary distribution.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Feb 20, 2018 at 6:58 AM, Shashank Raj <
>>>> shashank.raj2009@gmail.com> wrote:
>>>>
>>>>> Hi Karl,
>>>>>             I have setup ManifoldCF using Tomcat on two servers with
a
>>>>> load balancer in front of them. Both instances of ManifoldCf connect
to the
>>>>> same database. The scenario is to have a backup server running all the
>>>>> time. Is this setup correct or does ManifoldCF supports only a single
>>>>> server setup.
>>>>>
>>>>> Also, I am getting an error  : Duplicate key value violates unique
>>>>> constraint "repohistory_pkey". Detail: Key(id)=(1519119640499) already
>>>>> exists.
>>>>> This error pops up upon running jobs with different repositories.
>>>>>
>>>>> Our ManifoldCf job setup is as follows : File System>Tika Content
>>>>> Extractor>Solr Output Connection.
>>>>>
>>>>> Thanks and regards.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>

Mime
View raw message