lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno Mannina" <>
Subject RE: Is Solr can do that ?
Date Mon, 24 Jun 2019 09:17:47 GMT
Hi Toke,

Thanks for sharing this experience, it's very useful for me to have a first overview of what
will I need.
If I could resume, I will:
- learn about Tika
- Ask a lot of question like the frequency of add/update solr data
- Number of Users
- A first test with a representative sample

And of course a good expertise :)


-----Message d'origine-----
De : Toke Eskildsen [] 
Envoyé : samedi 22 juin 2019 11:36
À : solr_user lucene_apache
Objet : Re: Is Solr can do that ?

Matheo Software Info <> wrote:
> My question is very simple ☺ I would like to know if Solr can process 
> around 30To of data (Pdf, Text, Word, etc…) ?

Simple answer: Yes. Assuming 30To means 30 terabyte.

> What is the best way to index this huge data ? several servers ?
> several shards ? other ?

As other participants has mentioned, it is hard to give numbers. What we can do is share experience.

We are doing webarchive indexing and I guess there would be quite an overlap with your content
as we also use Tika. One difference is that the images in a webarchive are quite cheap to
index, so you'll probably need (relatively) more hardware than we use. Very roughly we used
40 CPU-years to index 600 (700? I forget) TB of data in one of our runs. Scaling to your 30TB
this suggests something like 2 CPU-years, or a couple of months for a 16 core machine.

This is just to get a ballpark: You will do yourself a huge favor by building a test-setup
and process 1 TB or so of your data to get _your_ numbers, before you design your indexing
setup. It is our experience that the analyzing part (Tika) takes much more power than the
Solr indexing part: At our last run we had 30-40 CPU-cores doing Tika (and related analysis)
feeding into a Solr running on a 4-core machine on spinning drives.

As for Solr setup for search, then you need to describe in detail what your requirements are,
before we can give you suggestions. Is the index updated all the time, in batches or one-off?
How many concurrent users? Are the searches interactive or batch-jobs? What kind of aggregations
do you need?

In our setup we build separate collections that are merged to single segments and never updated.
Our use varies between very few interactive users and a lot of batch jobs. Scaling this specialized
setup to your corpus size would require about 3TB of SSD, 64MB RAM and 4 CPU-cores, divided
among 4 shards. You are likely to need quite a lot more than that, so this is just to say
that at this scale the use of the index matters _a lot_.

- Toke Eskildsen

L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus

View raw message