lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1
Date Tue, 12 Apr 2011 03:54:40 GMT
The DIH has multi-threading. You can have one thread fetching files
and then give them to different threads.

On Mon, Apr 11, 2011 at 11:40 AM,  <karsten-solr@gmx.de> wrote:
> Hi Lance,
>
> I used XPathEntityProcessor with attribut "xsl" and generate a xml-File "in the form
of the standard Solr update schema".
> I lost a lot of performance, it is a pity that XPathEntityProcessor does only use one
thread.
>
> My tests with a collection of 350T Document:
> 1. use of XPathRecordReader without xslt:  28min
> 2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min
> 2. use of XPathEntityProcessor with saxon-xslt: 36min
>
>
> Best regards
>  Karsten
>
>
>
> -------- Lance
>> There is an option somewhere to use the full XML DOM implementation
>> for using xpaths. The purpose of the XPathEP is to be as simple and
>> dumb as possible and handle most cases: RSS feeds and other open
>> standards.
>>
>> Search for xsl(optional)
>>
>> http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
>>
> ----------karsten
>> > Hi Folks,
>> >
>> > does anyone improve DIH XPathRecordReader to deal with nested xpaths?
>> > e.g.
>> > data-config.xml with
>> >  <entity .. processor="XPathEntityProcessor" ..
>> >  <field column="title" xpath="//body/h1"/>
>> >  <field column="alltext” xpath="//body" flatten="true"/>
>> > and the XML stream contains
>> >  /html/body/h1...
>> > will only fill field “alltext” but field “title” will be empty.
>> >
>> > This is a known issue from 2009
>> >
>> https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
>> >
>> > So three questions:
>> > 1. How to fill a “search over all”-Field without nested xpaths?
>> >   (schema.xml  <copyField source="*" dest="alltext"/> will not help,
>> because we lose the original token order)
>> > 2. Does anyone try to improve XPathRecordReader to deal with nested
>> xpaths?
>> > 3. Does anyone else need this feature?
>> >
>> >
>> > Best regards
>> >  Karsten
>> >
>
> http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message