lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: German Compound Splitter words.fst causing problems.
Date Wed, 25 Mar 2015 22:14:52 GMT
Hello Chris - i don't know that token filter you mention but i would like to recommend Lucene's
HyphenationCompoundWordTokenFilter. It works reasonably well if you provide the hyphenation
rules and a dictionary. It has some flaws such as decompounding to irrelevant subwords, overlapping
subwords or to subwords that do not form the whole compound word (minus genitives),  but these
can be fixed.

Markus
 
-----Original message-----
> From:Chris Morley <chris@depahelix.com>
> Sent: Wednesday 25th March 2015 17:59
> To: solr-user@lucene.apache.org
> Subject: German Compound Splitter words.fst causing problems.
> 
> Hello, Chris Morley here, of Wayfair.com. I am working on the German compound-splitter
by Dawid Weiss. 
>   
>   I tried to "upgrade" the words.fst file that comes with the German compound-splitter
using Solr 3.5, but it doesn't work. Below is the IndexNotFoundException that I get.
>   
>  cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp lucene/build/lucene-core-3.5-SNAPSHOT.jar
org.apache.lucene.index.IndexUpgrader wordsFst
>  Exception in thread "main" org.apache.lucene.index.IndexNotFoundException: org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst
lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e
>                  at org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118)
>                  at org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85)
>   
>  The reason I'm attempting this at all is due to the answer here, http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7,
which says to do the upgrade in a two step process, first using Solr 3.5, and then the latest
Solr version (4.10.3).  When I try this running the unit tests for my modified German compound-splitter
I'm getting this same type of error.  The thing is, this is an FST, not an index, which is
a little confusing.  The reason why I'm following this answer though, is because I'm getting
that exact same message when trying to build the (modified) project with maven....at the point
at which it tries to load in words.fst. Below.
>   
>  [main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter - Format
version is not supported (resource: com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240):
0 (needs to be between 3 and 4). This version of Lucene only supports indexes created with
release 3.0 and later.  Failed to initialize static data structures for German compound splitter.
>   
>  Thanks,
>  -Chris.
> 
> 
> 

Mime
View raw message