lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Johnson <>
Subject Re: Post Processing Solr Results
Date Thu, 01 Sep 2011 17:23:43 GMT
Ok, think I got it.  Basically the issue was that I can't modify the
offset and start params when the search is a distributed one,
otherwise the correct offset and max are lost, a simple check in
prepare fixed this.

On Thu, Sep 1, 2011 at 11:10 AM, Jamie Johnson <> wrote:
> Ok, so I feel like I'm 90% of the way there.  For standard queries
> things work fine, but for distributed queries I'm running into a bit
> of an issue.  Right now the queries run fine but when doing
> distributed queries (using SolrCloud) the numFound is always getting
> set to the number of requested rows.  Can anyone shed some light on
> why this might be happening?
> On Tue, Aug 30, 2011 at 8:53 AM, Jamie Johnson <> wrote:
>> This might work in conjunction with what POST processing to help to
>> pair down the results, but the logic for the actual access to the data
>> is too complex to have entirely in solr.
>> On Mon, Aug 29, 2011 at 2:02 PM, Erick Erickson <> wrote:
>>> It's reasonable, but post-filtering is often difficult, you have
>>> too many documents to wade through. If you can see any way
>>> at all to just include a clause in the query, you'll save a world
>>> of effort...
>>> Is there any way you can include a value in some kind of
>>> "permissions" field? Let's say you have a document that
>>> is only to be visible for "tier 1" customers. If your permissions
>>> field contained the tiers (e.g. tier0, tier1), then a simple
>>> AND permissions:tier1 would do the trick...
>>> I know this is a trivial example, but you see where this is headed.
>>> The documents can contain as many of these tokens in permissions
>>> as you want. As long as you can string together a clause
>>> like "AND permissions:(A OR B OR C)" and not have the clause
>>> get ridiculously long (as in thousands of values), that works best.
>>> Any such scheme depends upon being able to assign the documents
>>> some kind of code that doesn't change too often (because when it does
>>> you have to re-index) and figure out, at query time, what permissions
>>> a user has.
>>> Using FieldCache or low-level Lucene routines can answer the question
>>> "Does doc X contain token Y in field Z" reasonably easily. What it has
>>> a hard time doing is answering "For document X, what are all the value
>>> in the inverted index in field Z".
>>> If this doesn't make sense, could you explain a bit more about your
>>> permissions model?
>>> Hope this helps
>>> Erick
>>> On Mon, Aug 29, 2011 at 11:46 AM, Jamie Johnson <> wrote:
>>>> Thanks guys, perhaps I am just going about this the wrong way.  So let
>>>> me explain my problem and perhaps there is a more appropriate
>>>> solution.  What I need to do is basically hide certain results based
>>>> on some passed in user parameter (say their service tier for
>>>> instance).  What I'd like to do is have some way to plugin my custom
>>>> logic to basically remove certain documents from the result set using
>>>> this information.  Now that being said I technically don't need to
>>>> remove the documents from the full result set, I really only need to
>>>> remove them from current page (but still ensuring that a page is
>>>> filled and sorted).  At present I'm trying to see if there is a way
>>>> for me to add this type of logic after the QueryComponent has
>>>> executed, perhaps by going through the DocIdandSet at this point and
>>>> then intersecting the DocIdSet with a DocIdSet which would filter out
>>>> the stuff I don't want seen.  Does this sound reasonable or like a
>>>> fools errand?
>>>> On Mon, Aug 29, 2011 at 10:51 AM, Erik Hatcher <>
>>>>> I haven't followed the details, but what I'm guessing you want here is
Lucene's FieldCache.  Perhaps something along the lines of how faceting uses it (in
>>>>>   FieldCache.DocTermsIndex si = FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(),
>>>>>        Erik
>>>>> On Aug 29, 2011, at 09:58 , Erick Erickson wrote:
>>>>>> If you're asking whether there's a way to find, say,
>>>>>> all the values for the "auth" field associated with
>>>>>> a document... no. The nature of an inverted
>>>>>> index makes this hard (think of finding all
>>>>>> the definitions in a dictionary where the word
>>>>>> "earth" was in the definition).
>>>>>> Best
>>>>>> Erick
>>>>>> On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson <>
>>>>>>> Thanks Erick, if I did not know the token up front that could
be in
>>>>>>> the index is there not an efficient way to get the field for
>>>>>>> specific document and do some custom processing on it?
>>>>>>> On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson <>
>>>>>>>> Start here I think:
>>>>>>>> Best
>>>>>>>> Erick
>>>>>>>> On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson <>
>>>>>>>>> Thanks for the reply.  The fields I want are indexed,
but how would I
>>>>>>>>> go directly at the fields I wanted?
>>>>>>>>> In regards to indexing the auth tokens I've thought about
this and am
>>>>>>>>> trying to get confirmation if that is reasonable given
>>>>>>>>> constraints.
>>>>>>>>> On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson <>
>>>>>>>>>> Yeah, loading the document inside a Collector is
>>>>>>>>>> definite no-no. Have you tried going directly
>>>>>>>>>> at the fields you want (assuming they're
>>>>>>>>>> indexed)? That *should* be much faster, but
>>>>>>>>>> whether it'll be fast enough is a good question.
>>>>>>>>>> thinking some of the Terms methods here. You
>>>>>>>>>> *might* get some joy out of making sure lazy
>>>>>>>>>> field loading is enabled (and make sure the
>>>>>>>>>> fields you're accessing for your logic are
>>>>>>>>>> indexed), but I'm not entirely sure about
>>>>>>>>>> that bit.
>>>>>>>>>> This kind of problem is sometimes handled
>>>>>>>>>> by indexing "auth tokens" with the documents
>>>>>>>>>> and including an OR clause on the query
>>>>>>>>>> with the authorizations for a particular
>>>>>>>>>> user, but that works best if there is an upper
>>>>>>>>>> limit (in the 100s) of tokens that a user can possibly
>>>>>>>>>> have, often this works best with some kind of
>>>>>>>>>> grouping. Making this work when a user can
>>>>>>>>>> have tens of thousands of auth tokens
>>>>>>>>>> contra-indicated...
>>>>>>>>>> Hope this helps a bit...
>>>>>>>>>> Erick
>>>>>>>>>> On Sun, Aug 28, 2011 at 11:59 PM, Jamie Johnson <>
>>>>>>>>>>> Just a bit more information.  Inside my class
which extends
>>>>>>>>>>> FilteredDocIdSet all of the time seems to be
getting spent in
>>>>>>>>>>> retrieving the document from the readerCtx, doing
>>>>>>>>>>> Document doc = readerCtx.reader.document(docid);
>>>>>>>>>>> If I comment out this and just return true things
fly along as I
>>>>>>>>>>> expect.  My query is returning a total of 2
million documents also.
>>>>>>>>>>> On Sun, Aug 28, 2011 at 11:39 AM, Jamie Johnson
<> wrote:
>>>>>>>>>>>> I have a need to post process Solr results
based on some access
>>>>>>>>>>>> controls which are setup outside of Solr,
currently we've written
>>>>>>>>>>>> something that extends SearchComponent and
in the prepare method I'm
>>>>>>>>>>>> doing something like this
>>>>>>>>>>>>                    QueryWrapperFilter
qwf = new
>>>>>>>>>>>> QueryWrapperFilter(rb.getQuery());
>>>>>>>>>>>>                    Filter filter
= new CustomFilter(qwf);
>>>>>>>>>>>>                    FilteredQuery
fq = new FilteredQuery(rb.getQuery(), filter);
>>>>>>>>>>>>                    rb.setQuery(fq);
>>>>>>>>>>>> Inside my CustomFilter I have a FilteredDocIdSet
which checks if the
>>>>>>>>>>>> document should be returned.  This works
as I expect but for some
>>>>>>>>>>>> reason is very very slow.  Even if I take
out any of the machinery
>>>>>>>>>>>> which does any logic with the document and
only return true in the
>>>>>>>>>>>> FilteredDocIdSets match method the query
still takes an inordinate
>>>>>>>>>>>> amount of time as compared to not including
this custom filter.  So my
>>>>>>>>>>>> question, is this the most appropriate way
of handling this?  What
>>>>>>>>>>>> should the performance out of such a setup
be expected to be?  Any
>>>>>>>>>>>> information/pointers would be greatly appreciated.

View raw message