lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roman Chyla <roman.ch...@gmail.com>
Subject Re: The most efficient way to get un-inverted view of the index?
Date Wed, 17 Aug 2016 23:01:24 GMT
in case this helps someone, here is a solution (probably very
efficient already, but i didn't profile it); it can deal with DocValues and
with FieldCache (the old 'stored' values)



private void unInvertedTheDamnThing(
      SolrIndexSearcher searcher,
      List<String> fields,
      KVSetter setter) throws IOException {

    LeafReader reader = searcher.getLeafReader();
  IndexSchema schema = searcher.getCore().getLatestSchema();
  List<LeafReaderContext> leaves = reader.getContext().leaves();

  Bits liveDocs;
  LeafReader lr;
  Transformer transformer;
    for (LeafReaderContext leave: leaves) {
   int docBase = leave.docBase;
   liveDocs = leave.reader().getLiveDocs();
   lr = leave.reader();
   FieldInfos fInfo = lr.getFieldInfos();

   for (String field: fields) {

     FieldInfo fi = fInfo.fieldInfo(field);
     SchemaField fSchema = schema.getField(field);
     DocValuesType fType = fi.getDocValuesType();
     Map<String,Type> mapping = new HashMap<String,Type>();
     final LeafReader unReader;

     if (fType.equals(DocValuesType.NONE)) {
       Class<? extends DocValuesType> c = fType.getClass();
          if (c.isAssignableFrom(TextField.class) ||
c.isAssignableFrom(StrField.class)) {
            if (fSchema.multiValued()) {
              mapping.put(field, Type.SORTED);
            }
            else {
              mapping.put(field, Type.BINARY);
            }
          }
          else if (c.isAssignableFrom(TrieIntField.class)) {
            if (fSchema.multiValued()) {
              mapping.put(field, Type.SORTED_SET_INTEGER);
            }
            else {
              mapping.put(field, Type.INTEGER_POINT);
            }
          }
          else {
            continue;
          }
          unReader = new UninvertingReader(lr, mapping);
     }
     else {
       unReader = lr;
     }

        switch(fType) {
       case NUMERIC:
         transformer = new Transformer() {
           NumericDocValues dv = unReader.getNumericDocValues(field);
           @Override
              public void process(int docBase, int docId) {
                int v = (int) dv.get(docId);
                setter.set(docBase, docId, v);
              }
         };
         break;
       case SORTED_NUMERIC:
         transformer = new Transformer() {
              SortedNumericDocValues dv =
unReader.getSortedNumericDocValues(field);
              @Override
              public void process(int docBase, int docId) {
                dv.setDocument(docId);
                int max = dv.count();
                int v;
                for (int i=0; i<max; i++) {
                  v = (int) dv.valueAt(i);
                  setter.set(docBase, docId, v);
                }
              }
            };
         break;
       case SORTED_SET:
         transformer = new Transformer() {
              SortedSetDocValues dv = unReader.getSortedSetDocValues(field);
              int errs = 0;
              @Override
              public void process(int docBase, int docId) {
                if (errs > 5)
                  return;
                dv.setDocument(docId);
                for (long ord = dv.nextOrd(); ord !=
SortedSetDocValues.NO_MORE_ORDS; ord = dv.nextOrd()) {
                  final BytesRef value = dv.lookupOrd(ord);
                  setter.set(docBase, docId, value.utf8ToString());
                }
              }
            };
         break;
       case SORTED:
         transformer = new Transformer() {
           SortedDocValues dv = unReader.getSortedDocValues(field);
              TermsEnum te;
              @Override
              public void process(int docBase, int docId) {
                BytesRef v = dv.get(docId);
                if (v.length == 0)
                  return;
                setter.set(docBase, docId, v.utf8ToString());
              }
            };
         break;
       default:
         throw new IllegalArgumentException("The field " + field + "
is of type that cannot be un-inverted");
     }

     int i = 0;
        while(i < lr.maxDoc()) {
          if (liveDocs != null && !(i < liveDocs.length() && liveDocs.get(i)))
{
            i++;
            continue;
          }
          transformer.process(docBase, i);
          i++;
        }
   }

  }
}

On Wed, Aug 17, 2016 at 1:22 PM, Roman Chyla <roman.chyla@gmail.com> wrote:
> Joel, thanks, but which of them? I've counted at least 4, if not more,
> different ways of how to get DocValues. Are there many functionally
> equal approaches just because devs can't agree on using one api? Or is
> there a deeper reason?
>
> Btw, the FieldCache is still there - both in lucene (to be deprecated)
> and in solr; but became package accessible only
>
> This is what removed the FieldCache:
> https://issues.apache.org/jira/browse/LUCENE-5666
> This is what followed: https://issues.apache.org/jira/browse/SOLR-8096
>
> And there is still code which un-inverts data from an index if no
> doc-values are available.
>
> --roman
>
> On Tue, Aug 16, 2016 at 9:54 PM, Joel Bernstein <joelsolr@gmail.com> wrote:
>> You'll want to use org.apache.lucene.index.DocValues. The DocValues api has
>> replaced the field cache.
>>
>>
>>
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, Aug 16, 2016 at 8:18 PM, Roman Chyla <roman.chyla@gmail.com> wrote:
>>
>>> I need to read data from the index in order to build a special cache.
>>> Previously, in SOLR4, this was accomplished with FieldCache or
>>> DocTermOrds
>>>
>>> Now, I'm struggling to see what API to use, there is many of them:
>>>
>>> on lucene level:
>>>
>>> UninvertingReader.getNumericDocValues (and others)
>>> <IndexReader>.getNumericValues()
>>> MultiDocValues.getNumericValues()
>>> MultiFields.getTerms()
>>>
>>> on solr level:
>>>
>>> reader.getNumericValues()
>>> UninvertingReader.getNumericDocValues()
>>> and extensions to FilterLeafReader - eg. very intersting, but
>>> undocumented facet accumulators (ex: NumericAcc)
>>>
>>>
>>> I need this for solr, and ideally re-use the existing cache [ie. the
>>> special cache is using another fields so those get loaded only once
>>> and reused in the old solr; which is a win-win situation]
>>>
>>> If I use reader.getValues() or FilterLeafReader will I be reading data
>>> every time the object is created? What would be the best way to read
>>> data only once?
>>>
>>> Thanks,
>>>
>>> --roman
>>>

Mime
View raw message