lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karsten-s...@gmx.de
Subject Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1
Date Mon, 11 Apr 2011 18:40:23 GMT
Hi Lance,

I used XPathEntityProcessor with attribut "xsl" and generate a xml-File "in the form of the
standard Solr update schema".
I lost a lot of performance, it is a pity that XPathEntityProcessor does only use one thread.

My tests with a collection of 350T Document:
1. use of XPathRecordReader without xslt:  28min
2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min  
2. use of XPathEntityProcessor with saxon-xslt: 36min  


Best regards
  Karsten



-------- Lance 
> There is an option somewhere to use the full XML DOM implementation
> for using xpaths. The purpose of the XPathEP is to be as simple and
> dumb as possible and handle most cases: RSS feeds and other open
> standards.
> 
> Search for xsl(optional)
> 
> http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
> 
----------karsten
> > Hi Folks,
> >
> > does anyone improve DIH XPathRecordReader to deal with nested xpaths?
> > e.g.
> > data-config.xml with
> >  <entity .. processor="XPathEntityProcessor" ..
> >  <field column="title" xpath="//body/h1"/>
> >  <field column="alltext” xpath="//body" flatten="true"/>
> > and the XML stream contains
> >  /html/body/h1...
> > will only fill field “alltext” but field “title” will be empty.
> >
> > This is a known issue from 2009
> >
> https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
> >
> > So three questions:
> > 1. How to fill a “search over all”-Field without nested xpaths?
> >   (schema.xml  <copyField source="*" dest="alltext"/> will not help,
> because we lose the original token order)
> > 2. Does anyone try to improve XPathRecordReader to deal with nested
> xpaths?
> > 3. Does anyone else need this feature?
> >
> >
> > Best regards
> >  Karsten
> >

http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html

Mime
View raw message