We have the same configuration
1. I am using PostgreSQL DB with tomcat 7 hosting MCF.
As an example our test corpus of (currently) 4 million documents about 4GB of PostgresQL when fully vacuumed. This should only be used as a very rough guide.
2. How much DB size should be considered for such scenarios as we have documents in magnitude of TBs.
We are running PostgresQL within KVM VM's with a single master replicated to 3 other backup nodes (probably OTT but we are aiming to replicate the configuration of each of the machines in our cluster as much as possible).
3. Does PostgreSQL run on VMs.
We are running ManifoldCF on each of the nodes in the cluster. The Zookeeper locking successfully allows us to crawl from each successfully.
4. What would be the ideal clustering approach: having two different MCF servers managed by Zookeeper with each having its own DB which are in sync with each other managed by a set of two load balancers or two different MCF instances having a common clustered(active/passive) DB instance managed by set of two load balancers.
IMHO - the biggest limiting factor will be the database but it really depends on your usage.
7. Which of these approaches would yield better results?
Not yet - I'm currently experimenting with various options/approaches at the moment. HA tends not to lend itself to a One-size-fits-all approach - at some point I'm sure there will be a 'Best Practices' guide. Feel free to keep asking questions in the interim.
8. Is there any definitive guide for high availability of MCF?