lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Brady <james.colin.br...@gmail.com>
Subject IOException: Mark invalid while analyzing HTML
Date Sun, 04 May 2008 22:35:41 GMT
Hi,
I'm seeing a problem mentioned in Solr-42, Highlighting problems with  
HTMLStripWhitespaceTokenizerFactory:
https://issues.apache.org/jira/browse/SOLR-42

I'm indexing HTML documents, and am getting reams of "Mark invalid"  
IOExceptions:
SEVERE: java.io.IOException: Mark invalid
	at java.io.BufferedReader.reset(Unknown Source)
	at  
org 
.apache 
.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 
728)
	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 
742)
	at java.io.Reader.read(Unknown Source)
	at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56)
	at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118)
	at  
org 
.apache 
.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249)
	at  
org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33)
	at  
org 
.apache 
.solr 
.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92)
	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45)
	at  
org 
.apache 
.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
	at  
org 
.apache 
.solr 
.analysis 
.RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java: 
33)
	at  
org 
.apache 
.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82)
	at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79)
	at org.apache.lucene.index.DocumentsWriter$ThreadState 
$FieldData.invertField(DocumentsWriter.java:1518)
	at org.apache.lucene.index.DocumentsWriter$ThreadState 
$FieldData.processField(DocumentsWriter.java:1407)
	at org.apache.lucene.index.DocumentsWriter 
$ThreadState.processDocument(DocumentsWriter.java:1116)
	at  
org 
.apache 
.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440)
	at  
org 
.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java: 
2422)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java: 
1445)


This is using a ~1 week old version of Solr 1.3 from SVN.

One workaround mentioned in that Jira issue was to move HTML stripping  
outside of Solr; can anyone suggest a better approach than that?

Thanks
James


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message