lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Umesh Prasad <umesh.i...@gmail.com>
Subject Re: Peronalized Search Results or Matching Documents to Users
Date Sat, 01 Aug 2015 04:41:07 GMT
Building on to Upayavira's comment :  This is a join problem.

There are 2 more approaches :
1. Solr PostFilter based approach. A sample filter can be found here
<http://qaware.blogspot.in/2014/11/how-to-write-postfilter-for-solr-49.html>.
At a very basic level it injects a interceptor collector into lucene. This
interceptor collector (called delegating collector) will run before the
final collector (which can be a DocSetCollector, or TopDocCollector). You
can inject any number of them and their ordering is controlled by the cost.
    collect( docId) can apply its own logic (like consulting an data
source, say an in memory map or array or even external data source ) ..
You can inject this post filter in the Document Index.
   lucene docid --> documentKey  (resolve using index)
   documentKey --> <userId1 , userId2, userId3 > (  in any data structure .
However it should be extremely efficient to retrieve it)
   The filter/ or delegating collector
<http://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/search/DelegatingCollector.html>knows
which userId(s) it should allow in results (coming from request) . If user
has permissions for document, it will call delegate.collect . Else it will
eat up the docId. (effectively achieving filtering).

docKey --> <userId1, userId2, userId3 > .. can be DocValues or ValueSource.
If there are lots of users, then it would be more efficient to use a bitset
where each userId is a bit.

If the userId permissions is highly selective (say more than 99% of
documents or even more) . Then it would be much more efficient to apply the
filter while finding the matching set itself..

That brings us to 2nd approach.

2. Custom solr Field Type
<http://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/schema/FieldType.html>.
You will need to implement getFieldQuery
<http://lucene.apache.org/solr/5_2_0/solr-core/org/apache/solr/schema/FieldType.html#getFieldQuery(org.apache.solr.search.QParser,
org.apache.solr.schema.SchemaField, java.lang.String)> that allows you to
inject your custom Query objet inside lucene. externalValue will be the
userID. SchemaField will hold the fieldName and you can safely ignore
QParser.
   This approach requires a some expertise with Lucene's  APIs and
segments, but it is most flexible and more scalable. Query itself can be
FilteredQuery
<https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/search/FilteredQuery.html>,
and Filter
<https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/search/Filter.html>
has
to be aware of inverted index .. Basically given UserId it has to give
postingList (docIdSet in lucene) ..  And it has to be managed by your
custom code. This is what essentially search time join does in a way.  It
resolves userId -->  <documentKey> using the "Permissions Index" and then
makes a search into "Document Index" using these documentKey as explicit
filters.


PS : We have used both of the above approaches.

You can also look at side_car index approach




On 1 August 2015 at 02:29, Upayavira <uv@odoko.co.uk> wrote:

> How soon? And will you be able to use them for querying, or just
> faceting/sorting/displaying?
>
> Thx!
>
> Upayavira
>
> On Fri, Jul 31, 2015, at 09:27 PM, Erick Erickson wrote:
> > And coming soon will be docvalues field updates that don't require
> > reindexing the whole doc.
> >
> > Best,
> > Erick
> > On Jul 31, 2015 6:51 AM, "Upayavira" <uv@odoko.co.uk> wrote:
> >
> > > On Thu, Jul 30, 2015, at 07:29 PM, Shawn Heisey wrote:
> > > > On 7/30/2015 10:46 AM, Robert Farrior wrote:
> > > > > We have a requirement to be able to have a master product catalog
> and
> > > to
> > > > > create a sub-catalog of products per user. This means I may have
> 10,000
> > > > > users who each create their own list of documents. This is a simple
> > > mapping
> > > > > of user to documents. The full data about the documents would be
in
> > > the main
> > > > > catalog.
> > > > >
> > > > > What approaches would allow Solr to only return the results that
> are
> > > in the
> > > > > user's list?  It seems like I would need a couple of steps in the
> > > process.
> > > > > In other words, the main catalog has 3 documents: A, B and C. I
> have 2
> > > > > users. User 1 has access to documents A and C but not B. User 2 has
> > > access
> > > > > to documents C and B but not A.
> > > > >
> > > > > When a user searches, I want to only return documents that the
> user has
> > > > > access to.
> > > >
> > > > A common approach for Solr would be to have a multivalued "user"
> field
> > > > on each document, which has individual values for each user that can
> > > > access the document.  When you index the document, you included
> values
> > > > in this field listing all the users that can access that document.
> > > >
> > > > Then you simply filter by user:
> > > >
> > > > fq=user:joe
> > > >
> > > > This is EXTREMELY efficient at query time, especially when the
> number of
> > > > users is much smaller than the number of documents.  It may
> complicate
> > > > indexing somewhat, but indexing is an extremely custom operation that
> > > > users have to write themselves, so it probably won't be horrible.
> > >
> > > Things to consider:
> > >
> > >  * How often are documents assigned to new users?
> > >  * How many documents does a user typically have?
> > >  * Do you have a 'trigger' in your app that tells you a user has been
> > >  assigned
> > >    a new doc?
> > >
> > > You can use a pseudo join to implement this sort of thing - have a
> > > different core that contains the 'permissions', either a document that
> > > says "this document ID is accessible via these users" or "this user is
> > > allowed to see these document IDs". You are keeping your fast moving
> > > (authorization) data separate from your slow moving (the docs
> > > themselves) data.
> > >
> > > You can then say "find me all documents that are accessible via user X"
> > >
> > > Upayavira
> > >
>



-- 
Thanks & Regards
Umesh Prasad
Tech Lead @ flipkart.com

 in.linkedin.com/pub/umesh-prasad/6/5bb/580/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message