lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Gazzarini <a.gazzar...@sease.io>
Subject Re: SynonimGraphFilter expands wrong synonims
Date Wed, 05 Sep 2018 14:10:19 GMT
You're right, my answer forgot to mention the *tokenizerFactory* 
parameter that you can add in the filter declaration. But, differently 
from what you think the default tokenizer used for parsing the synonyms 
_is not_ the tokenizer of the current analyzer (StandardTokenizer in 
your example) but WhitespaceTokenizer. See here [1] for a complete 
description of the filter capabilities.

So instead of switching the analyzer tokenizer you could also add a 
tokenizerFactory="solr.StandardTokenizerFactory" in the synonym filter 
declaration.

Best,
Andrea

[1] 
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-SynonymGraphFilter

On 05/09/2018 15:58, Danilo Tomasoni wrote:
> Hi Andrea,
>
> thank you for your answer.
>
> About the second question: The standardTokenizer should be applied 
> also to the phrase query, so the ' and - symbols should be removed 
> even there, and this should allow a match in the synonim file isn't it?
>
> With an example:
>
>
> in phrase query:
>
> "Cytosolic 5'-nucleotidase II" -> standardTokenizer -> Cytosolic, 5, 
> nucleotidase, II
>
>
> in synonym parsing:
>
> ...,Cytosolic 5'-nucleotidase II,... -> standardTokenizer -> 
> Cytosolic, 5, nucleotidase, II
>
>
> So the two graphs should match.. or I'm wrong?
> Thank you
> Danilo
>
> ody:On 05/09/2018 13:23, Andrea Gazzarini wrote:
>> Hi Danilo,
>> let's see if this can help you (I'm sorry for the poor debugging, I'm 
>> reading & writing from my mobile): the first issue should have 
>> something to do with synonym overlapping and since I'm very curious 
>> about what it is happening, I will be more precise when I will be in 
>> front of a laptop.
>>
>> The second: I guess the main problem is the StandardTokenizer, which 
>> removes the ' and - symbols. That should be the reason why you don't 
>> have any synonym detection. You should replace it with a 
>> WhitespaceTokenizer but, be aware that if you do that, the apostrophe 
>> in the document ( ′ ) is not the same symbol ( ' ) you've used in the 
>> query and in the synonyms file, so you need to replace it somewhere 
>> (in the document and/or in the query) otherwise you won't have any 
>> match.
>>
>> HTH
>> Gazza
>>
>> On 05/09/2018 12:19, Danilo Tomasoni wrote:
>>> Hello to all,
>>>
>>> I have an issue related to synonimgraphfilter expanding the wrong 
>>> synonims for a phrase-term at query time.
>>>
>>> I have a dictionary with the following lines
>>>
>>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>>> 5'-nucleotidase II
>>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>>> mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 
>>> 3,cytosolic,GBA3\, mRNA
>>>
>>> and two documents
>>>
>>> {"body":"8. The method of claim 6 wherein said method inhibits at 
>>> least one 5′-nucleotidase chosen from cytosolic 5′-nucleotidase II 
>>> (cN-II), cytosolic 5′-nucleotidase IA (cN-IA), cytosolic 
>>> 5′-nucleotidase IB (cN-IB), cytosolic 5′-nucleotidase IMA (cN-IIIA), 
>>> cytosolic 5′-nucleotidase NIB (cN-IIIB), ecto-5′-nucleotidase (eN, 
>>> CD73), cytosolic 5′(3′)-deoxynucleotidase (cdN) and mitochondrial 
>>> 5′(3′)-deoxynucleotidase (mdN)."}
>>> {"body":"Trichomonosis caused by the flagellate protozoan 
>>> Trichomonas vaginalis represents the most prevalent nonviral 
>>> sexually transmitted disease worldwide (WHO-DRHR 2012). In women, 
>>> the symptoms are cyclic and often worsen around the menstruation 
>>> period. In men, trichomonosis is largely asymptomatic and these men 
>>> are considered to be carriers of T. vaginalis (Petrin et al. 1998). 
>>> This infection has been associated with birth outcomes (Klebanoff et 
>>> al. 2001), infertility (Grodstein et al. 1993), cervical and 
>>> prostate cancer (Viikki et al. 2000, Sutcliffe et al. 2012) and 
>>> pelvic inflammatory disease (Cherpes et al. 2006). Importantly, T. 
>>> vaginalis is a co-factor in human immunodeficiency virus 
>>> transmission and acquisition (Sorvillo et al. 2001, Van Der Pol et 
>>> al. 2008). Therefore, it is important to study the host-parasite 
>>> relationship to understand T. vaginalis infection and pathogenesis. 
>>> Colonisation of the mucosa by T. vaginalis is a complex multi-step 
>>> process that involves distinct mechanisms (Alderete et al. 2004). 
>>> The parasite interacts with mucin (Lehker & Sweeney 1999), adheres 
>>> to vaginal epithelial cells (VECs) in a process mediated by adhesion 
>>> proteins (AP120, AP65, AP51, AP33 and AP23) and undergoes dramatic 
>>> morphological changes from a pyriform to an amoeboid form (Engbring 
>>> & Alderete 1998, Kucknoor et al. 2005, Moreno-Brito et al. 2005). 
>>> After adhesion to VECs, the synthesis and gene expression of 
>>> adhesins are increased (Kucknoor et al. 2005). These mechanisms must 
>>> be tightly regulated and iron plays a pivotal role in this 
>>> regulation. Iron is an essential element for all living organisms, 
>>> from the most primitive to the most complex, as a component of 
>>> haeme, iron-sulphur clusters and a variety of proteins. Iron is 
>>> known to contribute to biological functions such as DNA and RNA 
>>> synthesis, oxygen transport and metabolic reactions. T. vaginalis 
>>> has developed multiple iron uptake systems such as receptors for 
>>> hololactoferrin, haemoglobin (HB), haemin (HM) and haeme binding as 
>>> well as adhesins to erythrocytes and epithelial cells (Moreno-Brito 
>>> et al. 2005, Ardalan et al. 2009). Iron plays a crucial role in the 
>>> pathogenesis of trichomonosis by increasing cytoadherence and 
>>> modulating resistance to complement lyses, ligation to the 
>>> extracellular matrix and the expression of proteases 
>>> (Figueroa-Angulo et al. 2012). In agreement with this role, the 
>>> symptoms of trichomonosis worsen after menstruation. In addition, 
>>> iron also influences nucleotide hydrolysis in T. vaginalis (Tasca et 
>>> al. 2005, de Jesus et al. 2006). The extracellular concentrations of 
>>> ATP and adenosine can markedly increase under several conditions 
>>> such as inflammation and hypoxia as well as in the presence of 
>>> pathogens (Robson et al. 2006, Sansom 2012). In the extracellular 
>>> medium, these nucleotides can act as immunomodulators by triggering 
>>> immunological effects. Extracellular ATP acts as a proinflammatory 
>>> immune-mediator by triggering multiple immunological effects on cell 
>>> types such as neutrophils, macrophages, dendritic cells and 
>>> lymphocytes (Bours et al. 2006). In this sense, ATP and adenosine 
>>> concentrations in the extracellular compartment are controlled by 
>>> ectoenzymes, including those of the nucleoside triphosphate 
>>> diphosphohydrolase (NTPDase) (EC: 3.1.4.1) family, which hydrolyze 
>>> tri and diphosphates and ecto-5’-nucleotidase (EC: 3.1.3.5), which 
>>> hydrolyses monophosphates (Zimmermann 2001). Considering that de 
>>> novo nucleotide synthesis is absent in T. vaginalis (Heyworth et al. 
>>> 1982, 1984), this enzyme cascade is important as a source of the 
>>> precursor adenosine for purine synthesis in the parasite (Munagala & 
>>> Wang 2003). Extracellular nucleotide metabolism has been 
>>> characterised in several parasite species such as Toxoplasma gondii, 
>>> Schistosoma mansoni, Leishmania spp, Trypanosoma cruzi, 
>>> Acanthamoeba, Entamoeba histolytica, Giardia lamblia and fungi, 
>>> Saccharomyces cerevisiae, Cryptococcus neoformans, Candida 
>>> parapsilosis and Candida albicans (Sansom 2012). In T. vaginalis , 
>>> NTPDase and ecto-5’-nucleotidase activities have been characterised 
>>> and they are involved in host-parasite interactions by controlling 
>>> ATP and adenosine levels (Matos et al. 2001, d, de Jesus et al. 
>>> 2002, Tasca et al. 2003). Considering that (i) iron plays a crucial 
>>> role in the pathogenesis of trichomonosis, (ii) ATP exerts a 
>>> proinflammatory effect in inflammation, (iii) adenosine is important 
>>> to T. vaginalis growth and acts as an antiinflammatory factor 
>>> (Frasson et al. 2012) and (iv) ectonucleotidases modulate the 
>>> nucleotide levels at infection sites (such as those observed in 
>>> trichomonosis), the aim of this study was to investigate the effect 
>>> of iron on the extracellular nucleotide hydrolysis and gene 
>>> expression of T . vaginalis."}
>>>
>>> Body has the type "text_en" configured in this way
>>>
>>> <fieldType name="text_en"  class="solr.TextField" 
>>> positionIncrementGap="100">
>>>       <analyzer type="index">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.StopFilterFactory"
>>>                 ignoreCase="true"
>>>                 words="lang/stopwords_en.txt"
>>>             />
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>         <filter class="solr.KeywordMarkerFilterFactory" 
>>> protected="protwords.txt"/>
>>>         <filter class="solr.PorterStemFilterFactory"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>>         <filter class="solr.StopFilterFactory"
>>>                 ignoreCase="true"
>>>                 words="lang/stopwords_en.txt"
>>>         />
>>>         <filter class="solr.SynonymGraphFilterFactory" 
>>> synonyms="synonyms.txt"
>>>             ignoreCase="true"  expand="true"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>>>         <filter class="solr.KeywordMarkerFilterFactory" 
>>> protected="protwords.txt"/>
>>>         <filter class="solr.PorterStemFilterFactory"/>
>>>       </analyzer>
>>>     </fieldType>
>>>
>>> the two dictionary lines are in the file "synonyms.txt".
>>>
>>> If in a solr instance configured this way with those documents and I 
>>> run the following query
>>>
>>> (body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")
>>>
>>> both documents are returned.
>>>
>>> Surprisingly, if I run the query
>>>
>>> (body:"Cytosolic 5'-nucleotidase II")
>>>
>>> the second one is not returned.
>>>
>>> If I set debugQuery=true I see that the second line is expanded
>>>
>>> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, 
>>> acid 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to 
>>> Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, 
>>> mRNA,cDNA\, FLJ93688\, Homo sapiens glucosidase\, beta\, acid 
>>> 3,cytosolic,GBA3\, mRNA
>>>
>>> instead of the first
>>>
>>> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
>>> 5'-nucleotidase II
>>>
>>> The parsed query (given by debugquery) is
>>>
>>> "parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1, 
>>> spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 
>>> 0,true), spanNear([body:cytosolic,, body:isoform, body:cra_b], 
>>> 0,true), spanNear([body:cdna, body:flj78196,, body:highli, 
>>> body:similar, body:to, body:homo, body:sapien, body:glucosidase,, 
>>> body:beta,, body:acid, body:3], 0,true), body:cytosol, 
>>> spanNear([body:gba3,, body:mrna], 0,true), spanNear([body:cdna,, 
>>> body:flj93688,, body:homo, body:sapien, body:glucosidase,, 
>>> body:beta,, body:acid, body:3], 0,true), body:cytosol]), body:5, 
>>> body:nucleotidas, body:ii], 0,true))
>>>
>>> If I remove the second line, no synonym is expanded
>>>
>>>     "parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas 
>>> ii\")",
>>>
>>> I think this is related to the word "cytosolic" that appears as a 
>>> synonim for the second line. If I remove cytosolic as a synonim from 
>>> the second line, then again no synonym is expanded.
>>>
>>> Can you tell me why this happens? I thought that the first line 
>>> should be expanded since it has a multi-word synonym in it that 
>>> match exactly the phrase query.
>>>
>>> Thank you
>>>
>>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message