lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Snels" <>
Subject How to best index user-generated content
Date Wed, 20 Sep 2006 09:44:36 GMT

I want users to add content to my site using tinyMCE, which generates HTML.
When I tried adding the data to Solr, Solr refused to add it (or at least
generated an error):

SEVERE: org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG
or TEXT to read text (position: START_TAG seen ...<field name="text"><p>...
    at org.xmlpull.mxp1.MXParser.nextText(
    at org.apache.solr.core.SolrCore.readDoc(
    at org.apache.solr.core.SolrCore.update(
    at org.apache.solr.servlet.SolrUpdateServlet.doPost(
    at javax.servlet.http.HttpServlet.service(
    at javax.servlet.http.HttpServlet.service(
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(
    at org.apache.catalina.core.StandardWrapperValve.invoke(
    at org.apache.catalina.core.StandardContextValve.invoke(
    at org.apache.catalina.core.StandardHostValve.invoke(
    at org.apache.catalina.valves.ErrorReportValve.invoke(
    at org.apache.catalina.valves.RequestFilterValve.process(
    at org.apache.catalina.valves.RemoteAddrValve.invoke(
    at org.apache.catalina.core.StandardEngineValve.invoke(
    at org.apache.catalina.connector.CoyoteAdapter.service(
    at org.apache.coyote.http11.Http11Processor.process(
    at org.apache.tomcat.util.threads.ThreadPool$

So I searched the archives to resolve this issue, since I didn't want to
strip out the HTML entirely. The solution proved to be to add <![CDATA[
around the HTML text, like so:

   <field name="text"><![CDATA[#{field.text}]]></field>

This also drew my attention to another problem, characters like < > & are
all 'invalid' characters between xml tags. So that would mean, I have to put
<![CDATA[ around all the fields I want to index!? Because I don't know or
cann't control what my users will input. Is this the only solution or is
their a way for Solr to handle these 'invalid' characters in the indexed
text by itself, without generating errors?

Kind regards,


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message