lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Gazzarini <a.gazzar...@sease.io>
Subject Re: SynonimGraphFilter expands wrong synonims
Date Wed, 05 Sep 2018 11:23:30 GMT
Hi Danilo,
let's see if this can help you (I'm sorry for the poor debugging, I'm 
reading & writing from my mobile): the first issue should have something 
to do with synonym overlapping and since I'm very curious about what it 
is happening, I will be more precise when I will be in front of a laptop.

The second: I guess the main problem is the StandardTokenizer, which 
removes the ' and - symbols. That should be the reason why you don't 
have any synonym detection. You should replace it with a 
WhitespaceTokenizer but, be aware that if you do that, the apostrophe in 
the document ( ′ ) is not the same symbol ( ' ) you've used in the query 
and in the synonyms file, so you need to replace it somewhere (in the 
document and/or in the query) otherwise you won't have any match.

HTH
Gazza

On 05/09/2018 12:19, Danilo Tomasoni wrote:
> Hello to all,
>
> I have an issue related to synonimgraphfilter expanding the wrong 
> synonims for a phrase-term at query time.
>
> I have a dictionary with the following lines
>
> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
> 5'-nucleotidase II
> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid 
> 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo 
> sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, 
> FLJ93688\, Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA
>
> and two documents
>
> {"body":"8. The method of claim 6 wherein said method inhibits at 
> least one 5′-nucleotidase chosen from cytosolic 5′-nucleotidase II 
> (cN-II), cytosolic 5′-nucleotidase IA (cN-IA), cytosolic 
> 5′-nucleotidase IB (cN-IB), cytosolic 5′-nucleotidase IMA (cN-IIIA), 
> cytosolic 5′-nucleotidase NIB (cN-IIIB), ecto-5′-nucleotidase (eN, 
> CD73), cytosolic 5′(3′)-deoxynucleotidase (cdN) and mitochondrial 
> 5′(3′)-deoxynucleotidase (mdN)."}
> {"body":"Trichomonosis caused by the flagellate protozoan Trichomonas 
> vaginalis represents the most prevalent nonviral sexually transmitted 
> disease worldwide (WHO-DRHR 2012). In women, the symptoms are cyclic 
> and often worsen around the menstruation period. In men, trichomonosis 
> is largely asymptomatic and these men are considered to be carriers of 
> T. vaginalis (Petrin et al. 1998). This infection has been associated 
> with birth outcomes (Klebanoff et al. 2001), infertility (Grodstein et 
> al. 1993), cervical and prostate cancer (Viikki et al. 2000, Sutcliffe 
> et al. 2012) and pelvic inflammatory disease (Cherpes et al. 2006). 
> Importantly, T. vaginalis is a co-factor in human immunodeficiency 
> virus transmission and acquisition (Sorvillo et al. 2001, Van Der Pol 
> et al. 2008). Therefore, it is important to study the host-parasite 
> relationship to understand T. vaginalis infection and pathogenesis. 
> Colonisation of the mucosa by T. vaginalis is a complex multi-step 
> process that involves distinct mechanisms (Alderete et al. 2004). The 
> parasite interacts with mucin (Lehker & Sweeney 1999), adheres to 
> vaginal epithelial cells (VECs) in a process mediated by adhesion 
> proteins (AP120, AP65, AP51, AP33 and AP23) and undergoes dramatic 
> morphological changes from a pyriform to an amoeboid form (Engbring & 
> Alderete 1998, Kucknoor et al. 2005, Moreno-Brito et al. 2005). After 
> adhesion to VECs, the synthesis and gene expression of adhesins are 
> increased (Kucknoor et al. 2005). These mechanisms must be tightly 
> regulated and iron plays a pivotal role in this regulation. Iron is an 
> essential element for all living organisms, from the most primitive to 
> the most complex, as a component of haeme, iron-sulphur clusters and a 
> variety of proteins. Iron is known to contribute to biological 
> functions such as DNA and RNA synthesis, oxygen transport and 
> metabolic reactions. T. vaginalis has developed multiple iron uptake 
> systems such as receptors for hololactoferrin, haemoglobin (HB), 
> haemin (HM) and haeme binding as well as adhesins to erythrocytes and 
> epithelial cells (Moreno-Brito et al. 2005, Ardalan et al. 2009). Iron 
> plays a crucial role in the pathogenesis of trichomonosis by 
> increasing cytoadherence and modulating resistance to complement 
> lyses, ligation to the extracellular matrix and the expression of 
> proteases (Figueroa-Angulo et al. 2012). In agreement with this role, 
> the symptoms of trichomonosis worsen after menstruation. In addition, 
> iron also influences nucleotide hydrolysis in T. vaginalis (Tasca et 
> al. 2005, de Jesus et al. 2006). The extracellular concentrations of 
> ATP and adenosine can markedly increase under several conditions such 
> as inflammation and hypoxia as well as in the presence of pathogens 
> (Robson et al. 2006, Sansom 2012). In the extracellular medium, these 
> nucleotides can act as immunomodulators by triggering immunological 
> effects. Extracellular ATP acts as a proinflammatory immune-mediator 
> by triggering multiple immunological effects on cell types such as 
> neutrophils, macrophages, dendritic cells and lymphocytes (Bours et 
> al. 2006). In this sense, ATP and adenosine concentrations in the 
> extracellular compartment are controlled by ectoenzymes, including 
> those of the nucleoside triphosphate diphosphohydrolase (NTPDase) (EC: 
> 3.1.4.1) family, which hydrolyze tri and diphosphates and 
> ecto-5’-nucleotidase (EC: 3.1.3.5), which hydrolyses monophosphates 
> (Zimmermann 2001). Considering that de novo nucleotide synthesis is 
> absent in T. vaginalis (Heyworth et al. 1982, 1984), this enzyme 
> cascade is important as a source of the precursor adenosine for purine 
> synthesis in the parasite (Munagala & Wang 2003). Extracellular 
> nucleotide metabolism has been characterised in several parasite 
> species such as Toxoplasma gondii, Schistosoma mansoni, Leishmania 
> spp, Trypanosoma cruzi, Acanthamoeba, Entamoeba histolytica, Giardia 
> lamblia and fungi, Saccharomyces cerevisiae, Cryptococcus neoformans, 
> Candida parapsilosis and Candida albicans (Sansom 2012). In T. 
> vaginalis , NTPDase and ecto-5’-nucleotidase activities have been 
> characterised and they are involved in host-parasite interactions by 
> controlling ATP and adenosine levels (Matos et al. 2001, d, de Jesus 
> et al. 2002, Tasca et al. 2003). Considering that (i) iron plays a 
> crucial role in the pathogenesis of trichomonosis, (ii) ATP exerts a 
> proinflammatory effect in inflammation, (iii) adenosine is important 
> to T. vaginalis growth and acts as an antiinflammatory factor (Frasson 
> et al. 2012) and (iv) ectonucleotidases modulate the nucleotide levels 
> at infection sites (such as those observed in trichomonosis), the aim 
> of this study was to investigate the effect of iron on the 
> extracellular nucleotide hydrolysis and gene expression of T . 
> vaginalis."}
>
> Body has the type "text_en" configured in this way
>
> <fieldType name="text_en"  class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="lang/stopwords_en.txt"
>             />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" 
> protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="lang/stopwords_en.txt"
>         />
>         <filter class="solr.SynonymGraphFilterFactory" 
> synonyms="synonyms.txt"
>             ignoreCase="true"  expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPossessiveFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory" 
> protected="protwords.txt"/>
>         <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
> the two dictionary lines are in the file "synonyms.txt".
>
> If in a solr instance configured this way with those documents and I 
> run the following query
>
> (body:"Cytosolic 5'-nucleotidase II"  OR body:"EC 3.1.3.5")
>
> both documents are returned.
>
> Surprisingly, if I run the query
>
> (body:"Cytosolic 5'-nucleotidase II")
>
> the second one is not returned.
>
> If I set debugQuery=true I see that the second line is expanded
>
> A8K9N1,Glucosidase\, beta\, acid 3,Cytosolic,Glucosidase\, beta\, acid 
> 3,Cytosolic\, isoform CRA_b,cDNA FLJ78196\, highly similar to Homo 
> sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA,cDNA\, 
> FLJ93688\, Homo sapiens glucosidase\, beta\, acid 3,cytosolic,GBA3\, mRNA
>
> instead of the first
>
> P49902,Cytosolic purine 5'-nucleotidase,EC 3.1.3.5,Cytosolic 
> 5'-nucleotidase II
>
> The parsed query (given by debugquery) is
>
> "parsedquery":"SpanNearQuery(spanNear([spanOr([body:a8k9n1, 
> spanNear([body:glucosidase,, body:beta,, body:acid, body:3], 0,true), 
> spanNear([body:cytosolic,, body:isoform, body:cra_b], 0,true), 
> spanNear([body:cdna, body:flj78196,, body:highli, body:similar, 
> body:to, body:homo, body:sapien, body:glucosidase,, body:beta,, 
> body:acid, body:3], 0,true), body:cytosol, spanNear([body:gba3,, 
> body:mrna], 0,true), spanNear([body:cdna,, body:flj93688,, body:homo, 
> body:sapien, body:glucosidase,, body:beta,, body:acid, body:3], 
> 0,true), body:cytosol]), body:5, body:nucleotidas, body:ii], 0,true))
>
> If I remove the second line, no synonym is expanded
>
>     "parsedquery":"PhraseQuery(body_unnamed:\"cytosol 5 nucleotidas 
> ii\")",
>
> I think this is related to the word "cytosolic" that appears as a 
> synonim for the second line. If I remove cytosolic as a synonim from 
> the second line, then again no synonym is expanded.
>
> Can you tell me why this happens? I thought that the first line should 
> be expanded since it has a multi-word synonym in it that match exactly 
> the phrase query.
>
> Thank you
>


Mime
View raw message