Hi Lalit,

Like all of these things it depends ;-)

1.       I am using PostgreSQL DB with tomcat 7 hosting MCF.

We have the same configuration

2.       How much DB size should be considered for such scenarios as we have documents in magnitude of TBs.

As an example our test corpus of (currently) 4 million documents about 4GB of PostgresQL when fully vacuumed.  This should only be used as a very rough guide.

3.       Does PostgreSQL run on VMs.

We are running PostgresQL within KVM VM's with a single master replicated to 3 other backup nodes (probably OTT but we are aiming to replicate the configuration of each of the machines in our cluster as much as possible).

4.       What would be the ideal clustering approach: having two different MCF servers managed by Zookeeper with each having its own  DB which are in sync with each other  managed by a set of two load balancers or two different MCF instances having a common clustered(active/passive) DB instance managed by set of two load balancers.

We are running ManifoldCF on each of the nodes in the cluster.  The Zookeeper locking successfully allows us to crawl from each successfully.

7.       Which of these approaches would yield better results?

IMHO - the biggest limiting factor will be the database but it really depends on your usage.

8.       Is there any definitive guide for high availability of MCF?

Not yet - I'm currently experimenting with various options/approaches at the moment.  HA tends not to lend itself to a One-size-fits-all approach - at some point I'm sure there will be a 'Best Practices' guide.  Feel free to keep asking questions in the interim.

Regards,

Graeme