lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <>
Subject Re: access control list
Date Thu, 01 May 2008 02:04:15 GMT

On 30-Apr-08, at 5:31 PM, Kevin Osborn wrote:

> I have an index of about 3,000,000 products and about 8500  
> customers. Each customers has access to about 50 to about 500,000 of  
> the products.
> Our current method was using a bitset in the filter. So, for each  
> customer, they have a bitset in the cache. For each docId that they  
> have access to, the bit is set. This is probably the best  
> performance-wise for searches, but it consumes a lot of memory,  
> especially because each document that they don't have access to also  
> consumes space (a 0). It also is probably the cause of our problems  
> when either these customer access lists (stored in files) or the  
> index is updated.
> Is there a better way to manage access control? I was thinking of  
> storing the user access list as a specific document type in the  
> index. Basically, a single multi-value field. But I'm not quite sure  
> where to go from here.

The best way to go about this is to refactor the problem into the true  
constraints that exist.  It is unlikely that ~2,125,000,000 customer- 
product pairs were manually created.  Surely these resulted from  
groups of less fine-grained control.  Could these groups be the  
filters you use?

Another option is to look for ways to transform the data based on its  
intristic characteristics.  Even if there are no longer explicit  
control categories that you can leverage, you can look for groups of  
documents that many users share access to, or large groups of docs  
that few users have access to, and compose a single query's filter out  
groups.  This is probably pretty hard.  A simpler application of the  
idea is to look for a partitioning of the documents where few users  
having access to one set have access to the other set.  Put these in  
two separate solrs/cores.  Assuming a perfect partitioning, that  
halves memory consumption.

Also consider that currently filters of size < 3000 are stored as  
hashes (size proportional to # docs) rather than bitsets, thus consume  
less memory.  This is configurable (but don't go too high).


View raw message