lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryant, Michael" <michael.bry...@kcl.ac.uk>
Subject Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...
Date Fri, 10 Feb 2017 21:09:24 GMT
Darn, spoke too soon. Field collapsing throws off my facet counts where facet fields differ
within groups.

Back to the drawing board. FWIW, I tried hyperloglog for JSON facet aggregate counts and it
has the same issue as unique() when used as the facet sort parameter - while reasonably fast
it uses masses of memory.

Cheers,
~Mike

------
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 18:53, Bryant, Michael <michael.bryant@kcl.ac.uk<mailto:michael.bryant@kcl.ac.uk>>
wrote:

Hi Tom,

Well the collapsing query parser is… a much better solution to my problems!  Thanks for
cluing me in to this, I love it when you can delete a load of hacks for something both simpler
and faster.

Best,
~Mike


------
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 14:37, Tom Evans <tevans.uk@googlemail.com<mailto:tevans.uk@googlemail.com><mailto:tevans.uk@googlemail.com>>
wrote:

Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2Fsolr%2FCollapse%2Band%2BExpand%2BResults&data=01%7C01%7Cmichael.bryant%40kcl.ac.uk%7C3ff47afc049f4d3ce3ac08d451c25d84%7C8370cf1416f34c16b83c724071654356%7C0&sdata=sCjlX%2BLSh%2FdLmpMQCtKVH2wz8ESB1bZpDEkZWKxET2U%3D&reserved=0

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
<michael.bryant@kcl.ac.uk> wrote:
Hi all,

I'm converting my legacy facets to JSON facets and am seeing much better performance, especially
with high cardinality facet fields. However, the one issue I can't seem to resolve is excessive
memory usage (and OOM errors) when trying to simulate the effect of "group.facet" to sort
facets according to a grouping field.

My situation, slightly simplified is:

Solr 4.6.1

*   Doc set: ~200,000 docs
*   Grouping by item_id, an indexed, stored, single value string field with ~50,000 unique
values, ~4 docs per item
*   Faceting by person_id, an indexed, stored, multi-value string field with ~50,000 values
(w/ a very skewed distribution)
*   No docValues fields

Each document here is a description of an item, and there are several descriptions per item
in multiple languages.

With legacy facets I use group.field=item_id and group.facet=true, which gives me facet counts
with the number of items rather than descriptions, and correctly sorted by descending item
count.

With JSON facets I'm doing the equivalent like so:

&json.facet={
  "people": {
      "type": "terms",
      "field": "person_id",
      "facet": {
          "grouped_count": "unique(item_id)"
      },
      "sort": "grouped_count desc"
  }
}

This works, and is somewhat faster than legacy faceting, but it also produces a massive spike
in memory usage when (and only when) the sort parameter is set to the aggregate field. A server
that runs happily with a 512MB heap OOMs unless I give it a 4GB heap. With sort set to (the
default) "count desc" there is no memory usage spike.

I would be curious if anyone has experienced this kind of memory usage when sorting JSON facets
by stats and if there’s anything I can do to mitigate it. I’ve tried reindexing with docValues
enabled on the relevant fields and it seems to make no difference in this respect.

Many thanks,
~Mike


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message