lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roman Chyla <roman.ch...@gmail.com>
Subject Re: How to use BitDocSet within a PostFilter
Date Mon, 03 Aug 2015 14:30:55 GMT
Hi,
inStockSkusBitSet.get(currentChildDocNumber)

Is that child a lucene id? If yes, does it include offset? Every index
segment starts at a different point, but docs are numbered from zero. So to
check them against the full index bitset, I'd be doing
Bitset.exists(indexBase + docid)

Just one thing to check

Roman
On Aug 3, 2015 1:24 AM, "Stephen Weiss" <Steve.Weiss@wgsn.com> wrote:

> Hi everyone,
>
> I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl
> through grandchild documents during a search through the parents and filter
> out documents based on statistics gathered from aggregating the
> grandchildren together.  I've been successful in getting the logic correct,
> but it does not perform so well - I'm grabbing too many documents from the
> index along the way.  I'm trying to filter out grandchild documents which
> are not relevant to the statistics I'm collecting, in order to reduce the
> number of document objects pulled from the IndexReader.
>
> I've implemented the following code in my DelegatingCollector.collect:
>
> if (inStockSkusBitSet == null) {
> SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from
> IndexSearcher to expose getDocSet.
> inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery);
> inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from
> DocSet to expose getBits.
> inStockSkusBitSet = inStockSkusBitDocSet.getBits();
> }
>
>
> My BitDocSet reports a size which matches a standard query for the more
> limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also
> reports this same cardinality.  Based on that fact, it seems that the
> getDocSet call itself must be working properly, and returning the right
> number of documents.  However, when I try to filter out grandchild
> documents using either BitDocSet.exists or BitSet.get (passing over any
> grandchild document which doesn't exist in the bitdocset or return true
> from the bitset), I get about 1/3 less results than I'm supposed to.   It
> seems many documents that should match the filter, are being excluded, and
> documents which should not match the filter, are being included.
>
> I'm trying to use it either of these ways:
>
> if (!inStockSkusBitSet.get(currentChildDocNumber)) continue;
> if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue;
>
> The currentChildDocNumber is simply the docNumber which is passed to
> DelegatingCollector.collect, decremented until I hit a document that
> doesn't belong to the parent document.
>
> I can't seem to figure out a way to actually use the BitDocSet (or its
> derivatives) to quickly eliminate document IDs.  It seems like this is how
> it's supposed to be used.  What am I getting wrong?
>
> Sorry if this is a newbie question, I've never written a PostFilter
> before, and frankly, the documentation out there is a little sketchy
> (mostly for version 4) - so many classes have changed names and so many of
> the more well-documented techniques are deprecated or removed now, it's
> tough to follow what the current best practice actually is.  I'm using the
> block join functionality heavily so I'm trying to keep more current than
> that.  I would be happy to send along the full source privately if it would
> help figure this out, and plan to write up some more elaborate instructions
> (updated for Solr 5) for the next person who decides to write a PostFilter
> and work with block joins, if I ever manage to get this performing well
> enough.
>
> Thanks for any pointers!  Totally open to doing this an entirely different
> way.  I read DocValues might be a more elegant approach but currently that
> would require reindexing, so trying to avoid that.
>
> Also, I've been wondering if the query above would read from the filter
> cache or not.  The query is constructed like this:
>
>
>     private Term inStockTrueTerm = new Term("sku_history.is_in_stock",
> "T");
>     private Term objectTypeSkuHistoryTerm = new Term("object_type",
> "sku_history");
> ...
>
> inStockTrueTermQuery = new TermQuery(inStockTrueTerm);
> objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm);
> inStockSkusQuery = new BooleanQuery();
> inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST);
> inStockSkusQuery.add(objectTypeSkuHistoryTermQuery,
> BooleanClause.Occur.MUST);
> --
> Steve
>
> ________________________________
>
> WGSN is a global foresight business. Our experts provide deep insight and
> analysis of consumer, fashion and design trends. We inspire our clients to
> plan and trade their range with unparalleled confidence and accuracy.
> Together, we Create Tomorrow.
>
> WGSN<http://www.wgsn.com/> is part of WGSN Limited, comprising of
> market-leading products including WGSN.com<http://www.wgsn.com>, WGSN
> Lifestyle & Interiors<http://www.wgsn.com/en/lifestyle-interiors>, WGSN
> INstock<http://www.wgsninstock.com/>, WGSN StyleTrial<
> http://www.wgsn.com/en/styletrial/> and WGSN Mindset<
> http://www.wgsn.com/en/services/consultancy/>, our bespoke consultancy
> services.
>
> The information in or attached to this email is confidential and may be
> legally privileged. If you are not the intended recipient of this message,
> any use, disclosure, copying, distribution or any action taken in reliance
> on it is prohibited and may be unlawful. If you have received this message
> in error, please notify the sender immediately by return email and delete
> this message and any copies from your computer and network. WGSN does not
> warrant that this email and any attachments are free from viruses and
> accepts no liability for any loss resulting from infected email
> transmissions.
>
> WGSN reserves the right to monitor all email through its networks. Any
> views expressed may be those of the originator and not necessarily of WGSN.
> WGSN is powered by Top Right Group<http://www.topright-group.com>, which
> transforms knowledge businesses to deliver exceptional performance.
>
> Please be advised all phone calls may be recorded for training and quality
> purposes and by accepting and/or making calls from and/or to us you
> acknowledge and agree to calls being recorded.
>
> WGSN Limited, Company number 4858491
>
> registered address:
>
> Top Right Group Limited, The Prow, 1 Wilder Walk, London W1B 5AP
>
> WGSN Inc., tax ID 04-3851246, registered office c/o National Registered
> Agents, Inc., 160 Greentree Drive, Suite 101, Dover DE 19904, United States
>
> 4C Serviços de Informação Ltda., CNPJ/MF (Taxpayer's Register):
> 15.536.968/0001-04, Address: Avenida Nove de Julho, 5966, Loja, CEP
> 01406-200, Jardim Europa, São Paulo
>
> 4C Business Information Consulting (Shanghai) Co., Ltd, 富新商务信息咨询(上海)有限公司,
> registered address Unit 4810/4811, 48/F Tower 1, Grand Gateway, 1 Hong Qiao
> Road, Xuhui District, Shanghai
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message