lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Foltzer <acfolt...@gmail.com>
Subject Re: Problems with DIH XPath flatten
Date Wed, 07 Oct 2009 13:24:56 GMT
Here's a sample:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE document [
<!ENTITY nbsp "&#160;">
<!ENTITY copy "&#169;">
<!ENTITY reg "&#174;">
]>
<document>
  <kbml version="-//Indiana University//DTD KBML 0.9//EN">
    <kbq>In Mac OS X, how do I enable or disable the firewall?</kbq>
    <body>
<p><kbh docid="aghe" access="allowed">Mac OS
X<domain>all</domain><visibility>visible</visibility></kbh>
includes
an easy-to-use <kbh docid="aoru"
access="allowed">firewall<domain>all</domain><visibility>visible</visibility></kbh>
that
can prevent potentially harmful incoming connections from other
computers. To turn it on or off:</p>


<h3>Mac OS X 10.6 (Snow Leopard)</h3>

<ol><li>From the Apple menu, select <mi>System Preferences...†</mi>.
When the <code>System Preferences</code> window appears, from the
<mi>View</mi> menu, select <mi>Security</mi>.

<br clear="none"/><br clear="none"/>
</li><li>Click the <mi>Firewall</mi> tab.

...

</li></ol>
</body>
    <xtra>
      <term weight="0">macos</term>
      <term weight="0">macintosh</term>
      <term weight="0">apple</term>
      <term weight="0">macosx</term>

...

    </xtra>
  </kbml>
  <metadata>
    <docid>aozg</docid>
    <owner firstname="" lastname="Macintosh Support">scmac</owner>

...

  </metadata>
</document>

The /document/kbml/kbq works fine, but as you can see, it has no
children. The actual content of the document is within the body
element, though, which requires some flattening.

Thanks for your time,
Adam

2009/10/6 Noble Paul നോബിള്‍  नोब्ळ् <noble.paul@corp.aol.com>:
> send a small sample xml snippet you are trying to index and it may help
>
> On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer <acfoltzer@gmail.com> wrote:
>> Hi all,
>>
>> I'm trying to set up DataImportHandler to index some XML documents available
>> over web services. The XML includes both content and metadata, so for the
>> indexable content, I'm trying to just index everything under the content
>> tag:
>>
>> <entity dataSource="kbws" name="kbxml" pk="title"
>>        url="resturl" processor="XPathEntityProcessor"
>>        forEach="/document" transformer="HTMLStripTransformer"
>> flatten="true">
>> <field column="content" name="content" xpath="/document/kbml/body"
>> flatten="true" stripHTML="true" />
>> <field column="title" name="title" xpath="/document/kbml/kbq" />
>> </entity>
>>
>> The result of this is that the title field gets populated and indexed (there
>> are no child nodes of /document/kbml/kbq), but content does not get indexed
>> at all. Since /document/kbml/body has many children, I expected that
>> flatten="true" would store all of the body text in the field. Instead, it
>> stores nothing at all. I've tried this with many combinations of
>> transformers and flatten options, and the result is the same each time.
>>
>> Here are the relevant field declarations from the schema (the type="text" is
>> just the one from the example's schema.xml). I have tried combinations here
>> as well of stored= and multiValued=, with the same result each time.
>>
>> <field name="title" type="text" indexed="true" stored="true"
>> multiValued="true" />
>> <field name="content" type="text" indexed="true" stored="true"
>> multiValued="true" />
>>
>> If it would help troubleshooting, I could send along some sample XML. I
>> don't want to spam the list with an attachment unless it's necessary, though
>> :)
>>
>> Thanks in advance for your help,
>>
>> Adam Foltzer
>>
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>

Mime
View raw message