lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: Lexical analysis tools for German language data
Date Thu, 12 Apr 2012 12:09:04 GMT
Paul,

nearly two years ago I requested an evaluation license and tested BASIS Tech
Rosette for Lucene & Solr. Was working excellent but the price much much to high.

Yes, they also have compound analysis for several languages including German.
Just configure your pipeline in solr and setup the processing pipeline in
Rosette Language Processing (RLP) and thats it.

Example from my very old schema.xml config:

<fieldtype name="text_rlp" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
                rlpContext="solr/conf/rlp-index-context.xml"
                postPartOfSpeech="false"
                postLemma="true"
                postStem="true"
                postCompoundComponents="true"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
   <analyzer type="query">
     <tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
                rlpContext="solr/conf/rlp-query-context.xml"
                postPartOfSpeech="false"
                postLemma="true"
                postCompoundComponents="true"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldtype>

So you just point tokenizer to RLP and have two RLP pipelines configured,
one for indexing (rlp-index-context.xml) and one for querying (rlp-query-context.xml).

Example form my rlp-index-context.xml config:

<contextconfig>
  <properties>
    <property name="com.basistech.rex.optimize" value="false"/>
    <property name="com.basistech.ela.retokenize_for_rex" value="true"/>
  </properties>
  <languageprocessors>
    <languageprocessor>Unicode Converter</languageprocessor>
    <languageprocessor>Language Identifier</languageprocessor>
    <languageprocessor>Encoding and Character Normalizer</languageprocessor>
    <languageprocessor>European Language Analyzer</languageprocessor>
<!--    <languageprocessor>Script Region Locator</languageprocessor>
    <languageprocessor>Japanese Language Analyzer</languageprocessor>
    <languageprocessor>Chinese Language Analyzer</languageprocessor>
    <languageprocessor>Korean Language Analyzer</languageprocessor>
    <languageprocessor>Sentence Breaker</languageprocessor>
    <languageprocessor>Word Breaker</languageprocessor>
    <languageprocessor>Arabic Language Analyzer</languageprocessor>
    <languageprocessor>Persian Language Analyzer</languageprocessor>
    <languageprocessor>Urdu Language Analyzer</languageprocessor> -->
    <languageprocessor>Stopword Locator</languageprocessor>
    <languageprocessor>Base Noun Phrase Locator</languageprocessor>
<!--    <languageprocessor>Statistical Entity Extractor</languageprocessor>
-->
    <languageprocessor>Exact Match Entity Extractor</languageprocessor>
    <languageprocessor>Pattern Match Entity Extractor</languageprocessor>
    <languageprocessor>Entity Redactor</languageprocessor>
    <languageprocessor>REXML Writer</languageprocessor>
  </languageprocessors>
</contextconfig>

As you can see I used the "European Language Analyzer".

Bernd



Am 12.04.2012 12:58, schrieb Paul Libbrecht:
> Bernd,
> 
> can you please say a little more?
> I think this list is ok to contain some description for commercial solutions that satisfy
a request formulated on list.
> 
> Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary
of decomposed compounds in German? 
> If yes, for which domain? 
> The Google Search result (I wonder if this is politically correct to not have yours ;-))
shows me that there's an amount 
> of job done in this direction (e.g. Gärten to match Garten) but being precise for this
question would be more helpful!
> 
> paul
> 
> 

Mime
View raw message