lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Khludnev <m...@apache.org>
Subject Re: some parent documents
Date Wed, 04 Apr 2018 16:16:13 GMT
>
> What's happening under the hood of
> solr in answering query [1] from [2]?

https://github.com/apache/lucene-solr/blob/master/lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java#L178

On Wed, Apr 4, 2018 at 3:39 PM, Arturas Mazeika <mazeika@gmail.com> wrote:

> Hi Mikhail et al,
>
> Thanks a lot for a very thorough answer. This is an impressive piece of
> knowledge you just shared.
>
> Not surprisingly, I was caught unprepared by the 'v=...' part of the
> answer. This brought me to the links you posted (starts with http). From
> those links I went to the more updated link (starts with https), which
> brought me to other very resourceful links. Combined with some meditation
> session, it came into my mind that it is not possible to express block
> queries using mathematical logic only. The format of the input document is
> deeply built into the query expression and answering. Expressing these
> queries mathematically / logically may give an impression that solr is
> capable of answering (NP-?) hard problems. I have a feeling though that
> solr answers to queries in polynomial (or even almost linear) times.
>
> Just to connect the remaining dots.. What's happening under the hood of
> solr in answering query [1] from [2]? Is it really so that inverted index
> is used to identify the vectors of ids, that are scanned linearly in a hope
> to get matches on _root_ and other internal variables?
>
> [1] q=+{!parent which=type_s:product v=$skuq} +{!parent
> which=type_s:product v=$vendorq}&skuq=+COLOR_s:Blue +SIZE_s:XL +{!parent
> which=type_s:sku v='+QTY_i:[10 TO *] +STATE_s:CA'}&vendorq=+NAME_s:Bob
> +PRICE_i:[20 TO 25]
> [2]
> https://blog.griddynamics.com/searching-grandchildren-and-
> siblings-with-solr-block-join/
>
> Thanks!
> Arturas
>
> On Wed, Apr 4, 2018 at 12:36 PM, Mikhail Khludnev <mkhl@apache.org> wrote:
>
> > q=+{!parent which=ntype:p v='+msg:Hello +person:Arturas'} +{!parent
> which=
> > ntype:p v='+msg:ciao +person:Vai'}
> >
> > On Wed, Apr 4, 2018 at 12:19 PM, Arturas Mazeika <mazeika@gmail.com>
> > wrote:
> >
> > > Hi Mikhail et al,
> > >
> > > It seems to me that the nested documents must include nodes that encode
> > the
> > > level of nodes (within the document). Therefore, the minimal example
> must
> > > include the node type. Is the following structure sufficient?
> > >
> > > {
> > >     "id":1,
> > >     "ntype":"p",
> > >     "_childDocuments_":
> > >     [
> > >         {"id":"1_1", "ntype":"c", "person":"Vai",     "time":"3:14",
> > > "msg":"Hello"},
> > >         {"id":"1_2", "ntype":"c", "person":"Arturas", "time":"3:14",
> > > "msg":"Hello"},
> > >         {"id":"1_3", "ntype":"c", "person":"Vai",     "time":"3:15",
> > > "msg":"Coz Mathias is working on another system- different screen."},
> > >         {"id":"1_4", "ntype":"c", "person":"Vai",     "time":"3:15",
> > > "msg":"It can get annoying"},
> > >         {"id":"1_5", "ntype":"c", "person":"Arturas", "time":"3:15",
> > > "msg":"Thank you. this is very nice of you"},
> > >         {"id":"1_6", "ntype":"c", "person":"Vai",     "time":"3:16",
> > > "msg":"ciao"},
> > >         {"id":"1_7", "ntype":"c", "person":"Arturas", "time":"3:16",
> > > "msg":"ciao"}
> > >     ]
> > > },
> > > {
> > >     "id":2,
> > >     "ntype":"p",
> > >     "_childDocuments_":
> > >     [
> > >         {"id":"2_1", "ntype":"c", "person":"Vai",     "time":"4:14",
> > > "msg":"Hi"},
> > >         {"id":"2_2", "ntype":"c", "person":"Arturas", "time":"4:14",
> > > "msg":"IBM Watson"},
> > >         {"id":"2_3", "ntype":"c", "person":"Vai",     "time":"4:15",
> > > "msg":"need to retain content"},
> > >         {"id":"2_4", "ntype":"c", "person":"Vai",     "time":"4:15",
> > > "msg":"It can get annoying"},
> > >         {"id":"2_5", "ntype":"c", "person":"Arturas", "time":"4:15",
> > > "msg":"You can make all your meetings more access"},
> > >         {"id":"2_6", "ntype":"c", "person":"Vai",     "time":"4:16",
> > > "msg":"Make every meeting a Skype meeting"},
> > >         {"id":"2_7", "ntype":"c", "person":"Arturas", "time":"4:16",
> > > "msg":"ciao"}
> > >     ]
> > > }
> > >
> > > How would a query look like that has a Hello from Person Arturas and
> ciao
> > > from Person Vai?
> > >
> > > Cheers,
> > > Arturas
> > >
> > >
> > > On Tue, Apr 3, 2018 at 5:21 PM, Arturas Mazeika <mazeika@gmail.com>
> > wrote:
> > >
> > > > Hi Mikhail,
> > > >
> > > > Thanks a lot for the reply.
> > > >
> > > > You mentioned that
> > > >
> > > > q=+{!parent which.. v='+text:hello +person:A'} +{!parent
> > > > which..v='+text:ciao +person:B'}
> > > >
> > > > is the way to go. How would it look like precisely for the following
> > > > collection?
> > > >
> > > > {
> > > >     "id":1,
> > > >     "_childDocuments_":
> > > >     [
> > > >         {"id":"1_1", "person":"Vai"         , "time":"3:14",
> > > > "msg":"Hello"},
> > > >         {"id":"1_2", "person":"Arturas"     , "time":"3:14",
> > > > "msg":"Hello"},
> > > >         {"id":"1_3", "person":"Vai"         , "time":"3:15",
> "msg":"Coz
> > > > Mathias is working on another system- different screen."},
> > > >         {"id":"1_4", "person":"Vai"         , "time":"3:15",
> "msg":"It
> > > can
> > > > get annoying"},
> > > >         {"id":"1_5", "person":"Arturas"     , "time":"3:15",
> > "msg":"Thank
> > > > you. this is very nice of you"},
> > > >         {"id":"1_6", "person":"Vai"         , "time":"3:16",
> > > "msg":"ciao"},
> > > >         {"id":"1_7", "person":"Arturas"     , "time":"3:16",
> > > "msg":"ciao"}
> > > >     ]
> > > > },
> > > > {
> > > >     "id":2,
> > > >     "_childDocuments_":
> > > >     [
> > > >         {"id":"2_1", "person":"Vai"         , "time":"4:14",
> > > > "msg":"Hello"},
> > > >         {"id":"2_2", "person":"Arturas"     , "time":"4:14",
> "msg":"IBM
> > > > Watson"},
> > > >         {"id":"2_3", "person":"Vai"         , "time":"4:15",
> > "msg":"need
> > > > to retain content"},
> > > >         {"id":"2_4", "person":"Vai"         , "time":"4:15",
> "msg":"It
> > > can
> > > > get annoying"},
> > > >         {"id":"2_5", "person":"Arturas"     , "time":"4:15",
> "msg":"You
> > > > can make all your meetings more access"},
> > > >         {"id":"2_6", "person":"Vai"         , "time":"4:16",
> > "msg":"Make
> > > > every meeting a Skype meeting"},
> > > >         {"id":"2_7", "person":"Arturas"     , "time":"4:16",
> > > "msg":"ciao"}
> > > >     ]
> > > > }
> > > >
> > > > Cheers,
> > > > Arturas
> > > >
> > > >
> > > > On Tue, Apr 3, 2018 at 4:33 PM, Mikhail Khludnev <mkhl@apache.org>
> > > wrote:
> > > >
> > > >> Hello, Arturas.
> > > >>
> > > >> TLDR; Please find inline below.
> > > >>
> > > >> On Tue, Apr 3, 2018 at 5:14 PM, Arturas Mazeika <mazeika@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hi Solr Fans,
> > > >> >
> > > >> > I am trying to make sense of information retrieval using
> expressions
> > > >> like
> > > >> > "some parent", "*only parent*", " *all parent*". I am also trying
> to
> > > >> > understand the syntax "!parent which" and "!child of". On the
> > > technical
> > > >> > level, I am reading the following documents:
> > > >> >
> > > >> > [1]
> > > >> > https://lucene.apache.org/solr/guide/7_2/other-parsers.
> > > >> > html#block-join-query-parsers
> > > >> > [2]
> > > >> > https://lucene.apache.org/solr/guide/7_2/uploading-data-
> > > >> > with-index-handlers.html#nested-child-documents
> > > >> > [3] http://yonik.com/solr-nested-objects/
> > > >> >
> > > >> > and I am confused to read:
> > > >> >
> > > >> > This parser takes a query that matches some parent documents
and
> > > returns
> > > >> > their children. The syntax for this parser is: q={!child
> > > >> > of=<allParents>}<someParents>. The parameter allParents
is a
> filter
> > > that
> > > >> > matches *only parent documents*; here you would define the field
> and
> > > >> value
> > > >> > that you used to identify *all parent documents*. The parameter
> > > >> someParents
> > > >> > identifies a query that will match some of the parent documents.
> The
> > > >> output
> > > >> > is the children.
> > > >> >
> > > >> > The first sentence talks about "matching" but does not define
what
> > > that
> > > >> > means (and why it is only some parents matching?). The second
> > sentence
> > > >> > introduces a syntax of the parser, but blurs the understanding
as
> > > "some"
> > > >> > and "all" of parents are combined into one sentence. My
> > understanding
> > > is
> > > >> > that all documents are retrieve that satisfy a query. The query
> must
> > > >> > express some constraints on the parent node and some on the child
> > > node.
> > > >> I
> > > >> > have a feeling that "only parent documents" reads "criteria is
> > > >> formulated
> > > >> > over the parent part of {parent document}->{child document}
of
> > entity.
> > > >> > My simplified conceptual world of solr looks in the following
way:
> > > >> >
> > > >> > 1. Every document has an ID.
> > > >> > 2. Every document may have additional attributes
> > > >> > 3. Text attributes is what's at stake in solr. Sure we can search
> > for
> > > >> > products that costs at most X, but this is the added
> functionality.
> > > For
> > > >> > simplicity I am neglecting those here.
> > > >> > 4. The user has an information need. She expresses it with
> > (key)words
> > > >> and
> > > >> > hopes to find matching documents. For simplicity, I am skipping
> all
> > > >> issues
> > > >> > related to the information presentation of the documents
> > > >> > 5. Analysis chain (and inverse index) are the key technologies
> solr
> > is
> > > >> > based upon. Once the chain-processing is applied, mathematical
> logic
> > > >> kicks
> > > >> > in, retrieving the documents (that are a set of processed,
> > normalized,
> > > >> > enriched tokens) matching the query (processed, normalized and
> > > enriched
> > > >> > tokens). Clearly, the logic function can be a fancy one (at least
> > one
> > > of
> > > >> > query token is in the document set of tokens, etc.), ranking
is
> used
> > > to
> > > >> > sort the results.
> > > >> > 6. A nested document concept is introduced in solr. It needs
to be
> > > >> uploaded
> > > >> > into the index structure using a specific handlers [2]. A nested
> > > >> documents
> > > >> > is a tree. A root may contain children documents, which may be
> > parents
> > > >> of
> > > >> > grandchildren documents.
> > > >> > 7. Querying nested documents is supported in the following manner:
> > > >> >     7.1 Child documents are return that satisfies {parent
> > > >> > document}->{document}
> > > >> >     7.2 Parent documents are return that satisfy
> {document}->{child
> > > >> > document}
> > > >> >
> > > >> > Would I be very wrong to have this conceptual picture?
> > > >> >
> > > >> > From this point, the situation is a bit bury in my head. At the
> > core,
> > > I
> > > >> do
> > > >> > not really understand what "a document" is anymore (since the
> > complete
> > > >> json
> > > >> > or xml, so is a sub-json and sub-xml are documents, every document
> > > must
> > > >> > have an ID, does that meant the the subdocuments must have and
ID
> > too,
> > > >> or
> > > >> > sub-ids are also fine?), how to formulate mathematical expressions
> > > over
> > > >> > documents and what it means that the document satisfies my
> (key)word
> > > >> query?
> > > >> > Can we define a document to be the largest entity of information
> > that
> > > >> does
> > > >> > not contain any other nested documents [4]? If this is defined
and
> > > >> > communicated like this already where can I find it? There is
a use
> > of
> > > >> the
> > > >> > clarification, as the concept of the document means different
> things
> > > in
> > > >> > different contexts (e.g., you can update only the "complete
> > document"
> > > in
> > > >> > the index vs. parent document, etc.).
> > > >> >
> > > >> > Is it possible to formulate what's going on using mathematical
> > logic?
> > > >> Can
> > > >> > one express something like
> > > >> >
> > > >> > { give documents d : d is a document, d is parent of document
c, d
> > > >> > satisfies logical criteria C1,....,CN, c satisfies logical
> criteria
> > > >> > C1',...,CM'}
> > > >> > { give documents c : c is a document, d is parent of document
c, d
> > > >> > satisfies logical criteria C1,....,CN, c satisfies logical
> criteria
> > > >> > C1',...,CM'}
> > > >> >
> > > >> > here the meaning of document is as in definition [4] above.
> > > >> >
> > > >> > 1. Is it possible to retrieve all parent documents that have
two
> > > >> children
> > > >> > c1 and c2? Consider a document that is a skype chat, and children
> > are
> > > >> > individual lines of communication in the chat. I would be looking
> > for
> > > >> the
> > > >> > (parent) documents that have "hello" said by person A and "ciao"
> > said
> > > by
> > > >> > person B (as two different sub-documents).
> > > >> >
> > > >>
> > > >> q=+{!parent which.. v='+text:hello +person:A'} +{!parent which..
> > > >> v='+text:ciao +person:B'}
> > > >> The query syntax is really tricky and cumbersome.
> > > >>
> > > >>
> > > >> >
> > > >> > 2. Is it possible to search for documents such that they have
a
> > > >> grandchild
> > > >> > and the grandchild has the word "hello"?
> > > >> >
> > > >>
> > > >> http://blog-archive.griddynamics.com/2013/12/grandchildren-
> > > >> and-siblings-with-block.html
> > > >>
> > > >>
> > > >> >
> > > >> > 3. Is it possible to search for documents that do not have
> children?
> > > >> >
> > > >> q=-{!parent which..}type:child
> > > >> Beware that mixing parents and childfree products is not supported
> and
> > > >> causes pain. as a workaround you need to put empty child placeholder
> > > doc.
> > > >> Sic. Sorry.
> > > >>
> > > >>
> > > >> > Is this the right venue to discuss documentation of solr?
> > > >> >
> > > >> > Thanks!
> > > >> > Arturas
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Sincerely yours
> > > >> Mikhail Khludnev
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>



-- 
Sincerely yours
Mikhail Khludnev

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message