lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Feroze Daud" <fero...@zillow.com>
Subject RE: full-text indexing XML files
Date Fri, 11 Dec 2009 17:42:24 GMT
Yeah, xml tags as well. Essentially we want to full-text index the file,
without the need for stemming the tokens.

Will the SOLR analyzer be able to tokenize the document correctly if it
does not have any whitespaces (besides those required by XML syntax)?

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Thursday, December 10, 2009 8:00 PM
To: solr-user@lucene.apache.org
Subject: Re: full-text indexing XML files

What kind of searches do you want to do? Do you want to do searches that
match the XML tags?

wunder

On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote:

> Or CDATA (much easier to work with).
> 
> On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
> <shalinmangar@gmail.com> wrote:
>> On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud <ferozed@zillow.com>
wrote:
>> 
>>> Hi!
>>> 
>>> 
>>> 
>>> I am trying to full text index an XML file. For various reasons, I
>>> cannot use Tika or other technology to parse the XML file. The
>>> requirement is to full-text index the XML file, including Tags and
>>> everything.
>>> 
>>> 
>>> 
>>> So, I created a input index spec like this:
>>> 
>>> 
>>> 
>>> <add>
>>> 
>>> <doc>
>>> 
>>> <field name="id">1001</field>
>>> 
>>> <field name="name">NASA Advanced Research Labs</field>
>>> 
>>> <field name="address">1010 Main Street, Chattanooga, FL
32212</field>
>>> 
>>> <field name="content"><listing><id>1001</id>< name
> NASA Advanced
>>> Research Labs </ name ><address>1010 main street, chattanooga, FL
>>> 32212</address></listing></field>
>>> 
>>> </doc>
>>> 
>>> </add>
>>> 
>>> 
>>> 
>> You need to XML encode the value of the "content" field.
>> 
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com
> 


Mime
View raw message