lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alisa Z. <prol...@mail.ru>
Subject Re[4]: Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)
Date Mon, 02 May 2016 22:54:11 GMT
 >>You could add a "level2_comment_id" field to the level 2 commends and
>>it's children, and then use unique() on that.

OK, I see, I missed the children... Thank you for pointing out. 

I have introduced that "unique sub-branch identifying" field and propagated it down the subbranch
(the data is here: https://github.com/alisa-ipn/solr_nesting/blob/master/data/example-data-solr-for-faceting.json).
Also changed the corresponding part of the post. 

And it actually works. Yet it requires a lot of effort to make Json API faceting handle faceting
by intermediate levels.  

Making those "unique sub-branch identifying" fields dynamically appear the same way as the
"_root_" field does will make Solr use friendlier for nested data like email chains and social
media data... 

Thanks,
Alisa 

>Пятница, 22 апреля 2016, 13:47 -04:00 от Yonik Seeley <yseeley@gmail.com>:
>
>On Fri, Apr 22, 2016 at 12:26 PM, Alisa Z. < proloxx@mail.ru > wrote:
>>  Hi Yonik,
>>
>> Thanks a lot for your response.
>>
>> I have discussed this with Mikhail Khludnev already and tried this suggestion. Here's
what I've got:
>>
>>
>>
>> sentiment: positive
>> author: Bob
>> text: Great post about Solr
>> 2.blog-posts.comments-id: 10735-23004                           //this is a new field,
field name is different on each level for each type, values are unique
>> date: 2015-04-10T11:30:00Z
>> path: 2.blog-posts.comments
>> id: 10735-23004
>> Query:
>> curl http://localhost:8985/solr/solr_nesting_unique/query -d 'q=path:2.blog-posts.comments&rows=0&
>> json.facet={
>>   filter_by_child_type :{
>>     type:query,
>>     q:"path:*comments*keywords",
>>     domain: { blockChildren : "path:2.blog-posts.comments" },
>>     facet:{
>>       top_entity_text : {
>>         type: terms,
>>         field: text,
>>         limit: 10,
>>         sort: "counts_by_comments desc",
>>         facet: {
>>            counts_by_comments: "unique (2.blog-posts.comments-id )"             
  // changed
>>          }}}}}'
>
>
>Something is wrong if you are getting 0 counts.
>Lets try taking it piece-by-piece:
>
>Step 1:  q=path:2.blog-posts.comments
>This finds level 2 documents
>
>Step 2:  domain: { blockChildren : "path:2.blog-posts.comments" }
>This first maps to  all of the children (level 3 and level4)
>
>Step 3:  q:"path:*comments*keywords"
>This selects a subset of level3 and level4 documents with keywords
>(Note, in the future this should be doable as an additional filter in
>the domain spec, w/o an additional sub-facet level)
>
>Step 4:
>Facet on the text field of those level3 and level4 keyword docs. For
>each bucket, also find the unique number of values in the
>"2.blog-posts.comments-id" field on those documents.
>
>"Without seeing what you indexed, my guess is that the issue is that
>the "2.blog-posts.comments-id" field does not actually exist on those
>level3 and level4 docs being faceted.  The JSON Facet API doesn't
>propagate field values up/down the nested stack yet.  That's what
>https://issues.apache.org/jira/browse/SOLR-8998 is mostly about.
>
>-Yonik
>
>
>>
>> Response:
>>
>> "response":{"numFound":3,"start":0,"docs":[]
>>   },
>>   "facets":{
>>     "count":3,
>>     "filter_by_child_type":{
>>       "count":9,
>>       "top_entity_text":{
>>         "buckets":[{
>>             "val":"Elasticsearch",
>>             "count":2,
>>             "counts_by_comments":0},
>>           {
>>             "val":"Solr",
>>             "count":5,
>>             "counts_by_comments":0},
>>           {
>>             "val":"Solr 5.5",
>>             "count":1,
>>             "counts_by_comments":0},
>>           {
>>             "val":"feature",
>>             "count":1,
>>             "counts_by_comments":0}]}}}}
>>
>> So unless I messed something up... or the field name does not look "canonical" (but
it was fast to generate and  it is accepted in a normal query
>>  http://localhost:8985/solr/solr_nesting_unique/query?q=2.blog-posts.body-id :* )
>>
>> So I think that it's just a JSON facet API limitation...
>>
>> Best,
>> --Alisa
>>
>>
>>>Пятница, 22 апреля 2016, 9:55 -04:00 от Yonik Seeley < yseeley@gmail.com
>:
>>>
>>>Hi Alisa,
>>>This was a bit too hard for me to grok on a first pass... then I saw
>>>your related blog post which includes the actual sample data and makes
>>>it more clear.
>>>
>>> More comments inline:
>>>
>>>On Wed, Apr 20, 2016 at 2:29 PM, Alisa Z. <  proloxx@mail.ru > wrote:
>>>>  Hi all,
>>>>
>>>> I have been stretching some SOLR's capabilities for nested documents handling
and I've come up with the following issue...
>>>>
>>>> Let's say I have the following structure:
>>>>
>>>> {
>>>> "blog-posts":{                      //level 1
>>>>     "leaf-fields":[
>>>>         "date",
>>>>         "author"],
>>>>     "title":{                       //level 2
>>>>         "leaf-fields":[ "text"],
>>>>         "keywords":{                //level 3
>>>>             "leaf-fields":[
>>>>                 "text",
>>>>                 "type"]
>>>>             }
>>>>         },
>>>>     "body":{                        //level 2
>>>>         "leaf-fields":[ "text"],
>>>>         "keywords":{                //level 3
>>>>             "leaf-fields":[
>>>>                 "text",
>>>>                 "type"]
>>>>             }
>>>>         },
>>>>     "comments":{                    //level 2
>>>>         "leaf-fields":[
>>>>             "date",
>>>>             "author",
>>>>             "text",
>>>>             "sentiment"
>>>>             ],
>>>>         "keywords":{                //level 3
>>>>             "leaf-fields":[
>>>>                 "text",
>>>>                 "type"]
>>>>             },
>>>>         "replies":{                 //level 3
>>>>             "leaf-fields":[
>>>>                 "date",
>>>>                 "author",
>>>>                 "text",
>>>>                 "sentiment"],
>>>>             "keywords":{            //level 4
>>>>                 "leaf-fields":[
>>>>                     "text",
>>>>                     "type"]
>>>>                 }}}}}
>>>>
>>>>
>>>> And I want to know the distribution of all readers' keywords (levels 3 and
4) by comments (level 2).
>>>> In JSON Facet API I tried this:
>>>>
>>>> curl http://localhost:8983/solr/my_index/query -d 'q=path:2.blog-posts.comments&rows=0&
>>>> json.facet={
>>>>   filter_by_child_type :{
>>>>     type:query,
>>>>     q:"path:*comments*keywords",
>>>>     domain: { blockChildren : "path:2.blog-posts.comments" },
>>>>     facet:{
>>>>       top_keywords : {
>>>>         type: terms,
>>>>         field: text,
>>>>         sort: "counts_by_comments desc",
>>>>         facet: {
>>>>            counts_by_comments: "unique(_root_)"    // I suspect in should
be a different field, not _root_, but would it be for an intermediate document?
>>>>          }}}}}'
>>>>
>>>> Which gives me the wrong results, it aggregates by posts, not by comments
(it's a toy data set, so I know that the correct answer for "Solr" is 3 when faceted by for
comments)
>>>
>>>
>>>Yeah, this type if thing isn't currently directly supported, but
>>>SOLR-8998 should address that.
>>>You can currently hack around it (for simple counts) using unique(),
>>>as you've discovered, but you need a unique ID at the right level to
>>>get the right count.
>>>
>>>_root_ is unique for blog posts, hence that's why you get numbers of
>>>posts (as opposed to numbers of level-2 comments).
>>>You could add a "level2_comment_id" field to the level 2 commends and
>>>it's children, and then use unique() on that.
>>>
>>>-Yonik
>>>
>>>
>>>> {
>>>> "response":{"numFound":3,"start":0,"docs":[]
>>>>   },
>>>>   "facets":{
>>>>     "count":3,
>>>>     "filter_by_child_type":{
>>>>       "count":9,
>>>>       "top_keywords":{
>>>>         "buckets":[{
>>>>             "val":"Elasticsearch",
>>>>             "count":2,
>>>>             "counts_by_comments":2},
>>>>           {
>>>>             "val":"Solr",
>>>>             "count":5,
>>>>             "counts_by_comments":2},               //here the count by "comments"
should be 3
>>>>           {
>>>>             "val":"Solr 5.5",
>>>>             "count":1,
>>>>             "counts_by_comments":1},
>>>>           {
>>>>             "val":"feature",
>>>>             "count":1,
>>>>             "counts_by_comments":1}]}}}}
>>>>
>>>>
>>>> Am I writing the query wrong?
>>>>
>>>>
>>>> By the way, Block Join Faceting works fine for this:
>>>> bjqfacet?q={!parent%20which=path:2.blog-posts.comments}path:*.comments*keywords&rows=0&facet=true&child.facet.field=text&wt=json&indent=true
>>>>
>>>> {
>>>>   "response":{"numFound":3,"start":0,"docs":[]
>>>>   },
>>>>   "facet_counts":{
>>>>     "facet_queries":{},
>>>>     "facet_fields":{
>>>>       "text":[
>>>>         "Elasticsearch",2,
>>>>         "Solr",3,                                  //correct result
>>>>         "Solr 5.5",1,
>>>>         "feature",1]},
>>>>     "facet_dates":{},
>>>>     "facet_ranges":{},
>>>>     "facet_intervals":{},
>>>>     "facet_heatmaps":{}}}
>>>>
>>>> But we've already discussed that it returns too much stuff: no way to put
limits or order by counts :(  That's why I want to see whether it's posible to make JSON Facet
API straight.
>>>>
>>>> Thank you in advance!
>>>>
>>>> --
>>>> Alisa Zhila
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message