lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jürgen Wagner (DVT)" <juergen.wag...@devoteam.com>
Subject Re: Hardware requirement for 500 million documents
Date Sun, 04 Jan 2015 16:21:40 GMT
Hi Ali,
  the sizing is not just determined by the number of indexed documents
(and even less by the number of concurrent users).

- Document volume (number of documents, amount of  text data to be
indexed with each document, number and types of fields, the cardinality
of fields) guide you to the number of primary shards or collections you
want to have in your environment.

- Query volume determines replication factors to deal with proper
response times.

- The amount of concurrency (e.g., do you have primarily insertions of
new documents and then queries, or is there also a significant deletion
process running in parallel - partial updates count as
deletion+insertion) and the frequency of required index updates also
influences the sizing.

- Usually, processing (document to text, extractions, enrichment, ...) 
will be handled outside Solr (and has to be taken into account for the
entire platform scaling of hardware).

Some figures you may want to know before tackling this project are

- Are there different types of documents (e.g., text, media, data) that
have different textual amounts for indexing (e.g., plain text ~100%,
HTML ~90%, Microsoft Word ~15%, PDF ~10%, ...) to be handled?

- What are the size distributions (possibly over these types of documents)?

- What is the expected update frequency? Can you do incremental crawling?

- What types of attributes and facets are you planning to have for these
documents?

- How fresh an index do you need?

- Is this concurrent indexing and querying or will indexing happen,
e.g., at night, while during the day, users will query the platform?

- What are the types of typical queries issued by users?

- Will you have to take security into account (possibly leading to large
Boolean expressions added to queries to filter by entitlement groups)?

This will guide you into a first direction. Then run a prototype to
measure representative figures for scaling and make your estimates.

Best regards,
--Jürgen




On 04.01.2015 15:36, Ali Nazemian wrote:
> Hi,
> I was wondering what is the hardware requirement for indexing 500 million
> documents in Solr? Suppose maximum number of concurrent users in peak time
> would be 20.
> Thank you very much.
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
<mailto:juergen.wagner@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071



Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message