lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
Subject Re: tokenize into sentences/sentence splitter
Date Wed, 23 Sep 2015 21:15:12 GMT
Further to this problem, I have created a custom tokenizer but I cannot 
get it loaded properly by solr.
The error stacktrace:
----------------------------
Exception in thread "main" org.apache.solr.common.SolrException: 
SolrCore 'myproject' is not available due to init failure: Could not 
load conf for core myproject: Plugin init failure for [schema.xml] 
fieldType "myproject_text_2_sentences": Plugin init failure for 
[schema.xml] analyzer/tokenizer: Error instantiating class: 
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'. Schema file is 
D:\Work\myproject_github\myproject\solr-5.3.0\server\solr\myproject\conf\schema.xml
     at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:978)
     at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:147)
     at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
     at 
org.apache.solr.client.solrj.SolrClient.deleteByQuery(SolrClient.java:896)
     at 
org.apache.solr.client.solrj.SolrClient.deleteByQuery(SolrClient.java:859)
     at 
org.apache.solr.client.solrj.SolrClient.deleteByQuery(SolrClient.java:874)
     at my.app.Indexing.main(Indexing.java:31)
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:483)
     at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: org.apache.solr.common.SolrException: Could not load conf for 
core myproject: Plugin init failure for [schema.xml] fieldType 
"myproject_text_2_sentences": Plugin init failure for [schema.xml] 
analyzer/tokenizer: Error instantiating class: 
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'. Schema file is 
D:\Work\myproject_github\myproject\solr-5.3.0\server\solr\myproject\conf\schema.xml
     at 
org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:80)
     at org.apache.solr.core.CoreContainer.create(CoreContainer.java:725)
     at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
     at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:438)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] fieldType "myproject_text_2_sentences": Plugin init failure 
for [schema.xml] analyzer/tokenizer: Error instantiating class: 
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'. Schema file is 
D:\Work\myproject_github\myproject\solr-5.3.0\server\solr\myproject\conf\schema.xml
     at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:596)
     at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:175)
     at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
     at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
     at 
org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:104)
     at 
org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:75)
     ... 8 more
Caused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] fieldType "myproject_text_2_sentences": Plugin init failure 
for [schema.xml] analyzer/tokenizer: Error instantiating class: 
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'
     at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:178)
     at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
     ... 13 more
Caused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] analyzer/tokenizer: Error instantiating class: 
'my.lucene.tokenizer.WholeSentenceTokenizerFactory'
     at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:178)
     at 
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:361)
     at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:104)
     at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:52)
     at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:152)
     ... 14 more
Caused by: org.apache.solr.common.SolrException: Error instantiating 
class: 'my.lucene.tokenizer.WholeSentenceTokenizerFactory'
     at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:578)
     at 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:341)
     at 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:334)
     at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:152)
     ... 18 more
Caused by: java.lang.NoSuchMethodException: 
my.lucene.tokenizer.WholeSentenceTokenizerFactory.<init>(java.util.Map)
     at java.lang.Class.getConstructor0(Class.java:3074)
     at java.lang.Class.getConstructor(Class.java:1817)
     at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:569)
     ... 21 more
-----------------------------------



'WholeSentenceTokenizerFactory' looks like:
---------------------
package my.lucene.tokenizer;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeFactory;

import java.text.BreakIterator;
import java.util.Map;

public class WholeSentenceTokenizerFactory extends TokenizerFactory {
     /**
      * Initialize this factory via a set of key-value pairs.
      *
      * @param args
      */
     protected WholeSentenceTokenizerFactory(Map<String, String> args) {
         super(args);
     }

     @Override
     public Tokenizer create(AttributeFactory factory) {
         return new WholeSentenceTokenizer(factory, 
BreakIterator.getSentenceInstance());
     }
}
-------------------------------------------------


'WholeSentenceTokenizer':
-------------------------------------------------
public class WholeSentenceTokenizer extends SegmentingTokenizerBase {
     protected int sentenceStart, sentenceEnd;
     protected boolean hasSentence;

     private CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
     private OffsetAttribute offsetAtt = 
addAttribute(OffsetAttribute.class);

     public WholeSentenceTokenizer() {
         super(BreakIterator.getSentenceInstance());
     }

     public WholeSentenceTokenizer(BreakIterator iterator) {
         super(iterator);
     }

     public WholeSentenceTokenizer(AttributeFactory factory, 
BreakIterator iterator) {
         super(factory, iterator);
     }

     @Override
     protected void setNextSentence(int sentenceStart, int sentenceEnd) {
         this.sentenceStart = sentenceStart;
         this.sentenceEnd = sentenceEnd;
         hasSentence = true;
     }

     @Override
     protected boolean incrementWord() {
         if (hasSentence) {
             hasSentence = false;
             clearAttributes();
             termAtt.copyBuffer(buffer, sentenceStart, sentenceEnd - 
sentenceStart);
             offsetAtt.setOffset(correctOffset(offset + sentenceStart), 
correctOffset(offset + sentenceEnd));
             return true;
         } else {
             return false;
         }
     }
}
-------------------------------------


Both classes are compiled to a jar, placed insdie: 
/myproject_github/myproject/solr-5.3.0/contrib/myproject

And solrconfig.xml points to the jar by defining a "lib" as:
<lib dir="${solr.install.dir:../../..}/contrib/myproject" regex=".*\.jar" />



Any suggestions what have been wrong?

Many thanks

On 23/09/2015 19:08, Steve Rowe wrote:
> Hi Ziqi,
>
> Lucene has support for sentence chunking - see SegmentingTokenizerBase, implemeented
in ThaiTokenizer and HMMChineseTokenizer.  There is an example in that class’s tests that
creates tokens out of individual sentences: TestSegmentingTokenizerBase.WholeSentenceTokenizer.
>
> However, it sounds like you only need to store the sentences, not search against them,
so I don’t think you need sentence *tokenization*.
>
> why not simply use the JDK’s BreakIterator (or as you say OpenNLP) to do sentence splitting
and add to the doc as stored fields?
>
> Steve
> www.lucidworks.com
>
>> On Sep 23, 2015, at 11:39 AM, Ziqi Zhang <ziqi.zhang@sheffield.ac.uk> wrote:
>>
>> Thanks that is understood.
>>
>> My application is a bit special in the way that I need both an indexed field with
standard tokenization and an unindexed but stored field of sentences. Both must be present
for each document.
>>
>> I could possibly do with PatternTokenizer, but that is of course, less accurate than
e.g., wrapping OpenNLP sentence splitter in a lucene Tokenizer.
>>
>>
>>
>> On 23/09/2015 16:23, Doug Turnbull wrote:
>>> Sentence recognition is usually an NLP problem. Probably best handled
>>> outside of Solr. For example, you probably want to train and run a sentence
>>> recognition algorithm, inject a sentence delimiter, then use that delimiter
>>> as the basis for tokenization.
>>>
>>> More info on sentence recognition
>>> http://opennlp.apache.org/documentation/manual/opennlp.html
>>>
>>> On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang <ziqi.zhang@sheffield.ac.uk>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I need a special kind of 'token' which is a sentence, so I need a
>>>> tokenizer that splits texts into sentences.
>>>>
>>>> I wonder if there is already such or similar implementations?
>>>>
>>>> If I have to implement it myself, I suppose I need to implement a subclass
>>>> of Tokenizer. Having looked at a few existing implementations, it does not
>>>> look very straightforward how to do it. A few pointers would be highly
>>>> appreciated.
>>>>
>>>> Many thanks
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>
>> -- 
>> Ziqi Zhang
>> Research Associate
>> Department of Computer Science
>> University of Sheffield
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


-- 
Ziqi Zhang
Research Associate
Department of Computer Science
University of Sheffield


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message