Hi Lalit,

If I were you, I'd do a sample crawl of a characteristic subset of your documents, and then assess the space required by the database for that.  There's no way I can assess this in advance, because each connector has different space requirements in the database, and it depends to some degree on your documents as well -- specifically, the document metadata.

You should also read up on Postgresql maintenance procedures, because vacuuming frequency will determine how much extra disk space you will require due to dead tuples.

Thanks,
Karl





On Wed, Apr 16, 2014 at 9:53 AM, lalit jangra <lalit.j.jangra@gmail.com> wrote:
Thanks Karl,

I also want to know how to size disks for such setup.  I assume primarily the disk size will be taken by DB which is PostgreSQL here so what size to start with and what should be the expansion policy here keeping in mind i have minimum 10 million documents at start and similar volumes will be added each year.


On Wed, Apr 16, 2014 at 12:51 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Lalit,

ManifoldCF when operating in a clustered scenario will not work with separate DB instances, even if they are synched.  You can only operate it under conditions where transactional integrity is maintained, which would be a single common clustered DB instance.

I'll let others talk to your other points.

(Graeme, are you following this?)

Karl



On Wed, Apr 16, 2014 at 7:40 AM, lalit jangra <lalit.j.jangra@gmail.com> wrote:

Hi,

 

I am using MCF for crawling multiple sources having around 10-15 million documents initially & similar volumes added each year and I want it to be clustered in high availability mode. For same, I have some questions in mind.

1.       I am using PostgreSQL DB with tomcat 7 hosting MCF.

2.       How much DB size should be considered for such scenarios as we have documents in magnitude of TBs.

3.       Does PostgreSQL run on VMs.

4.       What would be the ideal clustering approach: having two different MCF servers managed by Zookeeper with each having its own  DB which are in sync with each other  managed by a set of two load balancers or two different MCF instances having a common clustered(active/passive) DB instance managed by set of two load balancers.

5.       If I use first approach : having two different MCF servers managed by Zookeeper with each having its own  DB which are in sync with each other  managed by a set of two load balancers – I need to sync both DB instances having extra tasks added.

6.       If I use second approach : or two different MCF instances having a common clustered(active/passive) DB instance managed by set of two load balancers – I have a set of clustered DBs.

7.       Which of these approaches would yield better results?

8.       Is there any definitive guide for high availability of MCF?

Regards,

Lalit.

 





--
Regards,
Lalit Jangra.