lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kingston Duffie (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SOLR-6097) Posting JSON with < > results in lost information
Date Tue, 20 May 2014 16:10:40 GMT
Kingston Duffie created SOLR-6097:
-------------------------------------

             Summary: Posting JSON with < > results in lost information
                 Key: SOLR-6097
                 URL: https://issues.apache.org/jira/browse/SOLR-6097
             Project: Solr
          Issue Type: Bug
    Affects Versions: 4.7.2
            Reporter: Kingston Duffie


Post the following JSON to add a document:

{ 
    "add" : 
       { 
           "commitWithin" : 5000,
           "doc" : 
               {  
                   "id" : "12345",
                   "body" : "a < b > c"
               }
        }
}

The body field is configured in the schema as:

   <field name="body" type="text_hive" indexed="true" stored="true" required="false" multiValued="false"/>

and

    <fieldType name="text_hive" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
		<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


The problem is this:  After submitting this post, if you go to the SOLR console and find this
document, the stored body will be missing the contents between the less-than and greater-than
symbols -- i.e., "a c".  

If you encode the body (i.e.,  "a &lt; b &gt; c"), it will show up with < and >
symbols.  That is, it appears that SOLR is stripping out HTML tags even though we are not
asking it to.

Note that it is not only the storage but also indexing that is affected (as we originally
found the issue because searching for "b" would not match this document.

I'm willing to believe that I'm doing something wrong, but I can't see anywhere in any spec
that suggests that strings inside JSON need to be 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message