lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Neubert <devmecr...@yahoo.com>
Subject Re: AW: What is the best way to index xml data preserving the mark up?
Date Thu, 08 Nov 2007 19:32:02 GMT
Chris

I'll try to track down your Jira issue.

(2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but know what I
need -- and basically its to search by the main granules in an xml document, with usually
turn out to be for books" book (rarley), chapter (more often), paragraph: (often) sentence:
(often).  Then there are niceties like chapter title, headings, etc. but I can live without
that -- but it seems like if you can exploit the text nodes of arbitrary XML you are looking
good, if not, you gotta a lot of machination in front of you.

Seems like Lucene/SOLR is geared to take record and non-xml-oriented content and put it into
XML format for ingest -- but really can't digest XML content itself at all without significant
setup and constraints.  I am surprised -- but I could really use it for my project big time.

Another problem I am having related (which I will probably repost separately) is boolean searches
across fields with multiple values.  At this point, because of my work arounds for Lucene
(to this point) I am indexing paragraphs as single documents with multiple fields, thinking
I could copy the sentences to text.  In that way, I can search field text (for the paragraph)
-- and search field sentence -- for sentence granularity.  The problem is that a search for
sentence:foo AND sentence:bar is matching if foo matches in any sentence of the paragraph,
and bar also matches in any sentence of the paragraph.  I need it to match only if foo and
bar are found in the same sentence. If this can't be do, looks like I will have to index paragraphs
as documents, and redundantly index sentences as unique documents. Again, I will post this
question separately immediately.

Thanks,

Dave


----- Original Message ----
From: Chris Hostetter <hossman_lucene@fucit.org>
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 1:19:40 PM
Subject: Re: AW: What is the best way to index xml data preserving the mark up?



: Thanks -- C-Data might be useful -- and I was looking into dynamic 
: fields as solution as well -- I think a combination of the two might 
: work.

I must admit i haven't been following this thread that closely, so i'm
 not 
sure how much of the "structure" of the XML you want to preserve for
 the 
purposes of querying, or if it's jsut an issue of wanting to store the
 raw 
XML, but on the the broader topic of indexing/searching arbitrary XML,
 i'd 
like to through out a few misc ideas i've had in the past that you
 might 
want to run with...

1) there's a Jira issue i pened a while back with a rough patch for 
applying a user specific XSLTs on the server to transforming arbitrary
 XML 
into the Solr XML update format (i don't have the issue number handy,
 and 
my browser is in the throws of death at the moment).  this might solve
 the 
"i want to send solr XML in my own schema, and i want to be able to
 tell 
it how to pull out various pieces to use as a field values.

2) I was once toying with the idea of an XPathTokenizer.  it would
 parse 
the fieldValues as XML, then apply arbitrary configured XPath
 expressions 
against the DOM and use the resulting NodeList to produce the
 TokenStream.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com



-Hoss






__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message