lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] [Resolved] (SOLR-7058) Data-driven schema needs to index large text fields as text and not as string
Date Thu, 29 Jan 2015 23:49:35 GMT

     [ https://issues.apache.org/jira/browse/SOLR-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jan Høydahl resolved SOLR-7058.
-------------------------------
    Resolution: Duplicate

Resolving as duplicate of SOLR-6966.

Again, I think this is a bad idea, it's hopeless to detect the difference, we need to define
a sane default and fix the OOTB ability to also search all text. Once users get past the basics
they'll start customizing the schema through API.

> Data-driven schema needs to index large text fields as text and not as string
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-7058
>                 URL: https://issues.apache.org/jira/browse/SOLR-7058
>             Project: Solr
>          Issue Type: Improvement
>          Components: Data-driven Schema
>            Reporter: Timothy Potter
>
> While using the SimplePostTool to index some freebase articles into a core that uses
our data-driven configs, I ran into the following gem:
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one immense
term in field="xml_data" (whose UTF8 encoding is longer than the max length 32766), all of
which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of
the first immense term is: '[60, 63, 120, 109, 108, 32, 118, 101, 114, 115, 105, 111, 110,
61, 34, 49, 46, 48, 34, 32, 101, 110, 99, 111, 100, 105, 110, 103, 61, 34]...', original message:
bytes can be at most 32766 in length; got 173684
> 	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
> 	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
> 	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
> 	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
> 	at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
> 	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1415)
> 	at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
> 	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
> {code}
> Ideally, the data-driven configs would index large text fields containing multiple tokens
(whitespace delimited) as text and not a string. However, this obviously poses an issue if
the first doc has a short text value that looks like a string and then the next doc has a
large one. Not sure what the right solution looks like yet, but wanted to capture the issue
so we can discuss options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message