lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Howe, David" <David.H...@auspost.com.au>
Subject Index size increases disproportionately to size of added field when indexed=false
Date Tue, 13 Feb 2018 02:14:36 GMT

Hi,

We are using Solr 7.1.0 to index a database of addresses.  We have found that our index size
increases massively when we add one extra field to the index, even though that field is stored
and not indexed, and doesn’t contain a lot of data.  When this occurs, we also observe a
significant increase in response times and CPU usage on the Solr server.

When we run an index load without the problematic field present, the Solr index size is 5.5GB.
 When we add the field into the index, the size grows to 13.3GB.  The field itself is a maximum
of 46 characters in length and on average is 19 characters. We have ~14,000,000 rows in total
to index of which only ~200,000 have this field present at all (i.e. not null in database).
 Given that we don’t want to index the field, only store it I would have thought (perhaps
naively) that the storage increase would be approximately 200,000 * 19 = 3.8M bytes = 3.6MB
rather than the 7.5GB we are seeing.

Some further background on what we are doing:

- We are using the Solr 7.1.0 docker image for our Solr server
- We are importing the data from an Oracle table using JDBC and the standard dataimport request
handler
- As we want to push the docker image to AWS ECR which only accepts docker layers of a maximum
of 10GB, we load the index in four separate imports, stopping Solr gracefully in between each
load
- Our index contains 48 fields in total
- The problematic field is created through the API as follows:

  curl -X POST -H 'Content-type:application/json' --data-binary '{
    "add-field":{
      "name":"buildingName",
      "type":"string",
      "stored":true,
      "indexed":false
    }
  }' http://localhost:8983/solr/address/schema

I have also tried using SolrText instead of string, but that doesn't make a noticeable difference.

It also makes a difference how many records are loaded.  If I only load 1,000,000 records
(that have a proportionate number of building names) then the size of the index with and without
buildingName is about the same (~1GB).

Is there some sort of limit that I'm not aware of that we are hitting, either number of fields
or size of data?  Is there some kind of corrupt data that I need to look for in the buildingName
field that could cause this (it's just a varchar2(46) field in Oracle)?

Thanks for your assistance,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  David.Howe@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. If we can assist
you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or
legally professionally privileged. It is intended exclusively for the individual or entity
to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act
in reliance on or commercialise the information if you are authorised to do so. Australia
Post does not represent, warrant or guarantee that the integrity of this email communication
has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to
the sender and then destroy any electronic or paper copy of this message. Any views expressed
in this email communication are taken to be those of the individual sender, except where the
sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Mime
View raw message