lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Wojciki <robin.wojc...@gmail.com>
Subject HTML Stripping slower in Solr 1.4?
Date Tue, 01 Dec 2009 04:18:39 GMT
Hello,

Our schema in Sol 1.3 looked like:

<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>

It takes 30s to index 1500 docs. When we run the same in Sol 1.4 it take 70s.

I noticed that HTMLStripStandardTokenizerFactory was deprecated. So
changed the schema to:
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>

It still takes 70s.

Instead, if I use the schema:
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>

It takes 30s in both 1.3 and 1.4.

I am not sure if HTMLStrip has become slower in 1.4 or HTML stripping
impacts perf down stream in 1.4. Before I started writing a unit test
with a TokenizerChain, I wanted to check if I am doing something
fundamentally wrong.

Robin

Mime
View raw message