lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Gülzau <kguel...@novomind.com>
Subject Indexing nouns only with UIMA works - performance issue?
Date Fri, 01 Feb 2013 10:17:08 GMT
I now use the "stupid" way to use the german corpus for UIMA: copy + paste :-)

I modified the Tagger-2.3.1.jar/HmmTagger.xml to use the german corpus
...
<fileResourceSpecifier>
  <fileUrl>file:german/TuebaModel.dat</fileUrl>
</fileResourceSpecifier>
...
and saved it as Tagger-2.3.1.jar/HmmTaggerDE.xml


Next step is to replace every occurrence of "HmmTagger" in
lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceAE.xml
with "HmmTaggerDE" an save it as
lucene-analyzers-uima-4.1.0.jar/uima/AggregateSentenceDEAE.xml

This can be used in your schema.xml:
<fieldType name="uima_nouns_de" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
      descriptorPath="/uima/AggregateSentenceDEAE.xml" tokenType="org.apache.uima.TokenAnnotation"
featurePath="posTag"/>
    <filter class="solr.TypeTokenFilterFactory" useWhitelist="true" types="/uima/whitelist_de.txt"
/>
  </analyzer>
</fieldType>

There should be a way to accomplish this via config though.



Last open issue: Performance!

First run via Admin GUI analyze index value "Klaus geht in das Haus und sieht eine Maus."
/ query: "": ~ 5 seconds
Feb 01, 2013 11:01:00 AM WhitespaceTokenizer initialize	Information: "Whitespace tokenizer
successfully initialized"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer typeSystemInit	Information: "Whitespace tokenizer
typesystem initialized"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer starts
processing"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer finished
processing"
Feb 01, 2013 11:01:02 AM WhitespaceTokenizer initialize	Information: "Whitespace tokenizer
successfully initialized"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer typeSystemInit	Information: "Whitespace tokenizer
typesystem initialized"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer starts
processing"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer finished
processing"
Feb 01, 2013 11:01:03 AM WhitespaceTokenizer initialize	Information: "Whitespace tokenizer
successfully initialized"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer typeSystemInit	Information: "Whitespace tokenizer
typesystem initialized"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer starts
processing"
Feb 01, 2013 11:01:05 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer finished
processing"

Second run via Admin GUI analyze "Klaus geht in das Haus und sieht eine Maus." / query: "":
~ 4 seconds
Feb 01, 2013 11:07:31 AM WhitespaceTokenizer initialize	Information: "Whitespace tokenizer
successfully initialized"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer typeSystemInit	Information: "Whitespace tokenizer
typesystem initialized"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer starts
processing"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer finished
processing"
Feb 01, 2013 11:07:32 AM WhitespaceTokenizer initialize	Information: "Whitespace tokenizer
successfully initialized"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer typeSystemInit	Information: "Whitespace tokenizer
typesystem initialized"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer starts
processing"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer finished
processing"
Feb 01, 2013 11:07:33 AM WhitespaceTokenizer initialize	Information: "Whitespace tokenizer
successfully initialized"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer typeSystemInit	Information: "Whitespace tokenizer
typesystem initialized"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer starts
processing"
Feb 01, 2013 11:07:34 AM WhitespaceTokenizer process		Information: "Whitespace tokenizer finished
processing"

Initialized 3 times?
I think some of the components are not reused while analyzing.

Is this a known issue?


Regards,

Kai Gülzau



-----Original Message-----
From: Kai Gülzau [mailto:kguelzau@novomind.com] 
Sent: Thursday, January 31, 2013 6:48 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing nouns only - UIMA vs. OpenNLP

UIMA:

I just found this issue https://issues.apache.org/jira/browse/SOLR-3013
Now I am able to use this analyzer for english texts and filter (un)wanted token types :-)

<fieldType name="uima_nouns_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.UIMATypeAwareAnnotationsTokenizerFactory"
      descriptorPath="/uima/AggregateSentenceAE.xml" tokenType="org.apache.uima.TokenAnnotation"
      featurePath="posTag"/>
    <filter class="solr.TypeTokenFilterFactory" types="/uima/stoptypes.txt" />
  </analyzer>
</fieldType>

Open issue -> How to set the ModelFile for the Tagger to "german/TuebaModel.dat" ???


Kai Gülzau

Mime
View raw message