lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thung, Peter C CIV SPAWARSYSCEN-PACIFIC, 56340" <peter.th...@navy.mil>
Subject Question on modifying solr behavior on indexing xml files..
Date Thu, 01 Oct 2009 09:40:55 GMT
1.  In my playing around with 
sending in an XML document within a an XML CDATA tag,
with termVectors="true"
 
I noticed the following behavior:
<person>peter</person>
collapses to the term
personpeterperson
instead of
person
and 
peter separately.
 
I realize I could try and do a search and replaces of characters like
<>"=  to a space so that the default parser/indexer can preserve element
names.
However, I'm wondering if someon could point me to where one might do
this withing
the solr or apache lucene code as a proper plug in with maybe an example
that I could use
as a template.  Also where in the solrconfig.xml file I would want to
change to reference the new parser.
 
2.  My other question would also be if this technique would work for XML
type messages embedded
in Microsoft Excel, or Powerpoint presentations where I would like to
preserve knowining xml element term frequencies
where I would try and leverage the component that automatically indexes
microsoft documents.
Would I need to modify that component and customize it?
 
-Peter
 
 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message