lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Gazzarini <a.gazzar...@sease.io>
Subject Re: SynonimGraphFilter expands wrong synonims
Date Fri, 07 Sep 2018 12:48:36 GMT
And as you probably already checked, inserting the proper 
*tokenizerFactory* also expands the right synonym line:

q = (body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")

parsedQuery = SpanOrQuery(spanOr([body:p49902, spanNear([body:cytosol, 
body:purin, body:5, body:nucleotidas], 0, true), spanNear([body:ec, 
body:3.1.3.5], 0, true), spanNear([body:cytosol, body:5, 
body:nucleotidas, body:ii], 0, true)])) SpanOrQuery(spanOr([body:p49902, 
spanNear([body:cytosol, body:purin, body:5, body:nucleotidas], 0, true), 
spanNear([body:cytosol, body:5, body:nucleotidas, body:ii], 0, true), 
spanNear([body:ec, body:3.1.3.5], 0, true)]))

Best,
Andrea

On 05/09/18 16:10, Andrea Gazzarini wrote:
>
> You're right, my answer forgot to mention the *tokenizerFactory* 
> parameter that you can add in the filter declaration. But, differently 
> from what you think the default tokenizer used for parsing the 
> synonyms _is not_ the tokenizer of the current analyzer 
> (StandardTokenizer in your example) but WhitespaceTokenizer. See here 
> [1] for a complete description of the filter capabilities.
>
> So instead of switching the analyzer tokenizer you could also add a 
> tokenizerFactory="solr.StandardTokenizerFactory" in the synonym filter 
> declaration.
>
> Best,
> Andrea
>
> [1] 
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-SynonymGraphFilter
>
> On 05/09/2018 15:58, Danilo Tomasoni wrote:
>> Hi Andrea,
>>
>> thank you for your answer.
>>
>> About the second question: The standardTokenizer should be applied 
>> also to the phrase query, so the ' and - symbols should be removed 
>> even there, and this should allow a match in the synonim file isn't it?
>>
>> With an example:
>>
>>
>> in phrase query:
>>
>> "Cytosolic 5'-nucleotidase II" -> standardTokenizer -> Cytosolic, 5, 
>> nucleotidase, II
>>
>>
>> in synonym parsing:
>>
>> ...,Cytosolic 5'-nucleotidase II,... -> standardTokenizer -> 
>> Cytosolic, 5, nucleotidase, II
>>
>>
>> So the two graphs should match.. or I'm wrong?
>> Thank you
>> Danilo
>>
>> ody:On 05/09/2018 13:23, Andrea Gazzarini wrote:
>>> Hi Danilo,
>>> let's see if this can help you (I'm sorry for the poor debugging, 
>>> I'm reading & writing from my mobile): the first issue should have 
>>> something to do with synonym overlapping and since I'm very curious 
>>> about what it is happening, I will be more precise when I will be in 
>>> front of a laptop.
>>>
>>> The second: I guess the main problem is the StandardTokenizer, which 
>>> removes the ' and - symbols. That should be the reason why you don't 
>>> have any synonym detection. You should replace it with a 
>>> WhitespaceTokenizer but, be aware that if you do that, the 
>>> apostrophe in the document ( ′ ) is not the same symbol ( ' ) you've 
>>> used in the query and in the synonyms file, so you need to replace 
>>> it somewhere (in the document and/or in the query) otherwise you 
>>> won't have any match.
>>>
>>> HTH
>>> Gazza
>>>
>>> On 05/09/2018 12:19, Danilo Tomasoni wrote:
>>>> Hello to all,
>>>>
>>>> I have an issue related to synonimgraphfilter expanding the wrong 
>>>> synonims for a phrase-term at query time.
>>>>
>>>> I have a dictionary with the following lines
>>>>
>>>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>>>> 5'-nucleotidase II
>>>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>>>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>>>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>>>> mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 
>>>> 3,cytosolic,GBA3\, mRNA
>>>>
>>>> and two documents
>>>>
>>>> {"body":"8. The method of claim 6 wherein said method inhibits at 
>>>> least one 5′-nucleotidase chosen from cytosolic 5′-nucleotidase II 
>>>> (cN-II), cytosolic 5′-nucleotidase IA (cN-IA), cytosolic 
>>>> 5′-nucleotidase IB (cN-IB), cytosolic 5′-nucleotidase IMA 
>>>> (cN-IIIA), cytosolic 5′-nucleotidase NIB (cN-IIIB), 
>>>> ecto-5′-nucleotidase (eN, CD73), cytosolic 5′(3′)-deoxynucleotidase

>>>> (cdN) and mitochondrial 5′(3′)-deoxynucleotidase (mdN)."}
>>>> {"body":"Trichomonosis caused by the flagellate protozoan 
>>>> Trichomonas vaginalis represents the most prevalent nonviral 
>>>> sexually transmitted disease worldwide (WHO-DRHR 2012). In women, 
>>>> the symptoms are cyclic and often worsen around the menstruation 
>>>> period. In men, trichomonosis is largely asymptomatic and these men 
>>>> are considered to be carriers of T. vaginalis (Petrin et al. 1998). 
>>>> This infection has been associated with birth outcomes (Klebanoff 
>>>> et al. 2001), infertility (Grodstein et al. 1993), cervical and 
>>>> prostate cancer (Viikki et al. 2000, Sutcliffe et al. 2012) and 
>>>> pelvic inflammatory disease (Cherpes et al. 2006). Importantly, T. 
>>>> vaginalis is a co-factor in human immunodeficiency virus 
>>>> transmission and acquisition (Sorvillo et al. 2001, Van Der Pol et 
>>>> al. 2008). Therefore, it is important to study the host-parasite 
>>>> relationship to understand T. vaginalis infection and pathogenesis. 
>>>> Colonisation of the mucosa by T. vaginalis is a complex multi-step 
>>>> process that involves distinct mechanisms (Alderete et al. 2004). 
>>>> The parasite interacts with mucin (Lehker & Sweeney 1999), adheres 
>>>> to vaginal epithelial cells (VECs) in a process mediated by 
>>>> adhesion proteins (AP120, AP65, AP51, AP33 and AP23) and undergoes 
>>>> dramatic morphological changes from a pyriform to an amoeboid form 
>>>> (Engbring & Alderete 1998, Kucknoor et al. 2005, Moreno-Brito et 
>>>> al. 2005). After adhesion to VECs, the synthesis and gene 
>>>> expression of adhesins are increased (Kucknoor et al. 2005). These 
>>>> mechanisms must be tightly regulated and iron plays a pivotal role 
>>>> in this regulation. Iron is an essential element for all living 
>>>> organisms, from the most primitive to the most complex, as a 
>>>> component of haeme, iron-sulphur clusters and a variety of 
>>>> proteins. Iron is known to contribute to biological functions such 
>>>> as DNA and RNA synthesis, oxygen transport and metabolic reactions. 
>>>> T. vaginalis has developed multiple iron uptake systems such as 
>>>> receptors for hololactoferrin, haemoglobin (HB), haemin (HM) and 
>>>> haeme binding as well as adhesins to erythrocytes and epithelial 
>>>> cells (Moreno-Brito et al. 2005, Ardalan et al. 2009). Iron plays a 
>>>> crucial role in the pathogenesis of trichomonosis by increasing 
>>>> cytoadherence and modulating resistance to complement lyses, 
>>>> ligation to the extracellular matrix and the expression of 
>>>> proteases (Figueroa-Angulo et al. 2012). In agreement with this 
>>>> role, the symptoms of trichomonosis worsen after menstruation. In 
>>>> addition, iron also influences nucleotide hydrolysis in T. 
>>>> vaginalis (Tasca et al. 2005, de Jesus et al. 2006). The 
>>>> extracellular concentrations of ATP and adenosine can markedly 
>>>> increase under several conditions such as inflammation and hypoxia 
>>>> as well as in the presence of pathogens (Robson et al. 2006, Sansom 
>>>> 2012). In the extracellular medium, these nucleotides can act as 
>>>> immunomodulators by triggering immunological effects. Extracellular 
>>>> ATP acts as a proinflammatory immune-mediator by triggering 
>>>> multiple immunological effects on cell types such as neutrophils, 
>>>> macrophages, dendritic cells and lymphocytes (Bours et al. 2006). 
>>>> In this sense, ATP and adenosine concentrations in the 
>>>> extracellular compartment are controlled by ectoenzymes, including 
>>>> those of the nucleoside triphosphate diphosphohydrolase (NTPDase) 
>>>> (EC: 3.1.4.1) family, which hydrolyze tri and diphosphates and 
>>>> ecto-5’-nucleotidase (EC: 3.1.3.5), which hydrolyses monophosphates 
>>>> (Zimmermann 2001). Considering that de novo nucleotide synthesis is 
>>>> absent in T. vaginalis (Heyworth et al. 1982, 1984), this enzyme 
>>>> cascade is important as a source of the precursor adenosine for 
>>>> purine synthesis in the parasite (Munagala & Wang 2003). 
>>>> Extracellular nucleotide metabolism has been characterised in 
>>>> several parasite species such as Toxoplasma gondii, Schistosoma 
>>>> mansoni, Leishmania spp, Trypanosoma cruzi, Acanthamoeba, Entamoeba 
>>>> histolytica, Giardia lamblia and fungi, Saccharomyces cerevisiae, 
>>>> Cryptococcus neoformans, Candida parapsilosis and Candida albicans 
>>>> (Sansom 2012). In T. vaginalis , NTPDase and ecto-5’-nucleotidase 
>>>> activities have been characterised and they are involved in 
>>>> host-parasite interactions by controlling ATP and adenosine levels 
>>>> (Matos et al. 2001, d, de Jesus et al. 2002, Tasca et al. 2003). 
>>>> Considering that (i) iron plays a crucial role in the pathogenesis 
>>>> of trichomonosis, (ii) ATP exerts a proinflammatory effect in 
>>>> inflammation, (iii) adenosine is important to T. vaginalis growth 
>>>> and acts as an antiinflammatory factor (Frasson et al. 2012) and 
>>>> (iv) ectonucleotidases modulate the nucleotide levels at infection 
>>>> sites (such as those observed in trichomonosis), the aim of this 
>>>> study was to investigate the effect of iron on the extracellular 
>>>> nucleotide hydrolysis and gene expression of T . vaginalis."}
>>>>
>>>> Body has the type "text_en" configured in this way
>>>>
>>>> <fieldType name="text_en"  class="solr.TextField" 
>>>> positionIncrementGap="100">
>>>>       <analyzer type="index">
>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>         <filter class="solr.StopFilterFactory"
>>>>                 ignoreCase="true"
>>>>                 words="lang/stopwords_en.txt"
>>>>             />
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>>         <filter class="solr.KeywordMarkerFilterFactory" 
>>>> protected="protwords.txt"/>
>>>>         <filter class="solr.PorterStemFilterFactory"/>
>>>>       </analyzer>
>>>>       <analyzer type="query">
>>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>         <filter class="solr.StopFilterFactory"
>>>>                 ignoreCase="true"
>>>>                 words="lang/stopwords_en.txt"
>>>>         />
>>>>         <filter class="solr.SynonymGraphFilterFactory" 
>>>> synonyms="synonyms.txt"
>>>>             ignoreCase="true"  expand="true"/>
>>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>>         <filter class="solr.KeywordMarkerFilterFactory" 
>>>> protected="protwords.txt"/>
>>>>         <filter class="solr.PorterStemFilterFactory"/>
>>>>       </analyzer>
>>>>     </fieldType>
>>>>
>>>> the two dictionary lines are in the file "synonyms.txt".
>>>>
>>>> If in a solr instance configured this way with those documents and 
>>>> I run the following query
>>>>
>>>> (body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")
>>>>
>>>> both documents are returned.
>>>>
>>>> Surprisingly, if I run the query
>>>>
>>>> (body:"Cytosolic 5'-nucleotidase II")
>>>>
>>>> the second one is not returned.
>>>>
>>>> If I set debugQuery=true I see that the second line is expanded
>>>>
>>>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>>>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>>>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>>>> mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 
>>>> 3,cytosolic,GBA3\, mRNA
>>>>
>>>> instead of the first
>>>>
>>>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>>>> 5'-nucleotidase II
>>>>
>>>> The parsed query (given by debugquery) is
>>>>
>>>> "parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1, 
>>>> spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 
>>>> 0,true), spanNear([body:cytosolic,, body:isoform, body:cra_b], 
>>>> 0,true), spanNear([body:cdna, body:flj78196,, body:highli, 
>>>> body:similar, body:to, body:homo, body:sapien, body:glucosidase,, 
>>>> body:beta,, body:acid, body:3], 0,true), body:cytosol, 
>>>> spanNear([body:gba3,, body:mrna], 0,true), spanNear([body:cdna,, 
>>>> body:flj93688,, body:homo, body:sapien, body:glucosidase,, 
>>>> body:beta,, body:acid, body:3], 0,true), body:cytosol]), body:5, 
>>>> body:nucleotidas, body:ii], 0,true))
>>>>
>>>> If I remove the second line, no synonym is expanded
>>>>
>>>>     "parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas 
>>>> ii\")",
>>>>
>>>> I think this is related to the word "cytosolic" that appears as a 
>>>> synonim for the second line. If I remove cytosolic as a synonim 
>>>> from the second line, then again no synonym is expanded.
>>>>
>>>> Can you tell me why this happens? I thought that the first line 
>>>> should be expanded since it has a multi-word synonym in it that 
>>>> match exactly the phrase query.
>>>>
>>>> Thank you
>>>>
>>>
>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message