lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Fwd: Solr dynamic field blowing up the index size
Date Tue, 21 Feb 2017 16:33:17 GMT
Did you reuse the schema or rebuilt it on top of the latest examples?
Because the latest example schema enabled docValues for strings on the
fieldType level.

I would do a diff of the schemas to see what changed. If they look
very different and you are looking for tools to normalize/extract
elements from schemas, you may find my latest Revolution presentation
useful for that:
https://www.slideshare.net/arafalov/rebuilding-solr-6-examples-layer-by-layer-lucenesolrrevolution-2016
(e.g. slide 20). There is also the video there at the end.

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 21 February 2017 at 11:18, Mike Thomsen <mikerthomsen@gmail.com> wrote:
> Correct me if I'm wrong, but heavy use of doc values should actually blow
> up the size of your index considerably if they are in fields that get sent
> a lot of data.
>
> On Tue, Feb 21, 2017 at 10:50 AM, Pratik Patel <pratik@semandex.net> wrote:
>
>> Thanks for the reply. I can see that in solr 6, more than 50% of the index
>> directory is occupied by ".nvd" file extension. It is something related to
>> norms and doc values.
>>
>> On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch <
>> arafalov@gmail.com>
>> wrote:
>>
>> > Did you look in the data directories to check what index file extensions
>> > contribute most to the difference? That could give a hint.
>> >
>> > Regards,
>> >     Alex
>> >
>> > On 21 Feb 2017 9:47 AM, "Pratik Patel" <pratik@semandex.net> wrote:
>> >
>> > > Here is the same question in stackOverflow for better format.
>> > >
>> > > http://stackoverflow.com/questions/42370231/solr-
>> > > dynamic-field-blowing-up-the-index-size
>> > >
>> > > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine
>> > but
>> > > the problem is that index size with solr 6 is way too large. In solr 5,
>> > > index size was about 15GB and in solr 6, for the same data, the index
>> > size
>> > > is 300GB! I am not able to understand what contributes to such huge
>> > > difference in solr 6.
>> > >
>> > > I have been able to identify a field which is blowing up the size of
>> > index.
>> > > It is as follows.
>> > >
>> > > <dynamicField name="*_note" type="text_general" indexed="true"
>> > > stored="true" multiValued="true"  />
>> > >
>> > > <field name="textproperty" type="text_general" indexed="true"
>> > > stored="false" multiValued="true"  />
>> > > <copyField source="*_note" dest="textproperty"/>
>> > >
>> > > When this field is commented out, the index size reduces to less than
>> > 10GB.
>> > >
>> > > This field is of type text_general. Following is the definition of this
>> > > type.
>> > >
>> > > <fieldType name="text_general" class="solr.TextField"
>> > > positionIncrementGap="100">
>> > >       <analyzer type="index">
>> > >         <charFilter class="solr.HTMLStripCharFilterFactory" />
>> > >         <tokenizer class="solr.StandardTokenizerFactory"/>
>> > >         <filter class="solr.LowerCaseFilterFactory"/>
>> > >         <charFilter class="solr.PatternReplaceCharFilterFactory"
>> > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
>> > >         <filter class="solr.WordDelimiterFilterFactory"
>> > > protected="protwords.txt" generateWordParts="1"
>> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> > > catenateAll="0" splitOnCaseChange="0"/>
>> > >         <filter class="solr.KStemFilterFactory" />
>> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
>> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
>> > > />
>> > >       </analyzer>
>> > >       <analyzer type="query">
>> > >         <charFilter class="solr.HTMLStripCharFilterFactory" />
>> > >         <tokenizer class="solr.StandardTokenizerFactory"/>
>> > >         <filter class="solr.LowerCaseFilterFactory"/>
>> > >         <charFilter class="solr.PatternReplaceCharFilterFactory"
>> > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
>> > >         <filter class="solr.WordDelimiterFilterFactory"
>> > > protected="protwords.txt" generateWordParts="1"
>> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> > > catenateAll="0" splitOnCaseChange="0"/>
>> > >         <filter class="solr.KStemFilterFactory" />
>> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
>> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
>> > > />
>> > >       </analyzer>
>> > >   </fieldType>
>> > >
>> > > Few things which I did to debug this issue:
>> > >
>> > >    - I have ensured that field type definition is same as what I was
>> > using
>> > >    in solr 5 and it is also valid in version 6. This field type
>> > considers a
>> > >    list of "stopwords" to be ignored during indexing. I have supplied
>> the
>> > > same
>> > >    list of stopwords which we were using in solr 5. I have verified
>> that
>> > > path
>> > >    of this file is correct and it is being loaded fine in solr admin
>> UI.
>> > > When
>> > >    I analyse these fields using "Analysis" tab of the solr admin UI, I
>> > can
>> > > see
>> > >    that stopwords are being filtered out. However, when I query with
>> some
>> > > of
>> > >    these stopwords, I do get the results back which makes me think that
>> > >    probably stopwords are being indexed.
>> > >
>> > > Any idea what could increase the size of index by so much in solr 6?
>> > >
>> >
>>

Mime
View raw message