lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>
Subject Re: Specifying multiple documents in DataImportHandler dataConfig
Date Sun, 08 Nov 2009 07:57:20 GMT
On Sun, Nov 8, 2009 at 8:25 AM, Bertie Shen <bertie.shen@gmail.com> wrote:
> I have figured out a way to solve this problem: just specify a
> single <document> blah blah blah </document>. Under <document>, specify
> multiple top level entity entries, each of which corresponds to one table
> data.
>
> So each top level entry will map one row in it to a document in Lucene
> index. <document> in DIH is *NOT* mapped to a document in Lucene index while
> top-level entity is. I feel <document> tag is redundant and misleading in
> data config and thus should be removed.

There are some common attributes specified at the <document> level .
It still acts as a container tag .
>
> Cheers.
>
> On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <bertie.shen@gmail.com> wrote:
>
>> I have the same problem. I had thought we could specify multiple <document>
>> blah blah blah</document>s, each of which is mapping one table in the RDBMS.
>> But I found it was not the case. It only picks the first <document>blah blah
>> blah</document> to do indexing.
>>
>> I think Rupert's  and my request are pretty common. Basically there are
>> multiple tables in RDBMS, and we want each row in each table become a
>> document in Lucene index. How can we write one data config.xml file to let
>> DataImportHandler import multiple tables at the same time?
>>
>> Rupert, have you figured out a way to do it?
>>
>> Thanks.
>>
>>
>>
>> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <rufiasco@gmail.com> wrote:
>>
>>> Maybe I should be more clear: I have multiple tables in my DB that I
>>> need to save to my Solr index. In my app code I have logic to persist
>>> each table, which maps to an application model to Solr. This is fine.
>>> I am just trying to speed up indexing time by using DIH instead of
>>> going through my application. From what I understand of DIH I can
>>> specify one dataSource element and then a series of document/entity
>>> sets, for each of my models. But like I said before, DIH only appears
>>> to want to index the first document declared under the dataSource tag.
>>>
>>> -Rupert
>>>
>>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<rufiasco@gmail.com> wrote:
>>> > I am using the DataImportHandler with a JDBC datasource. From my
>>> > understanding of DIH, for each of my "content types" e.g. Blog posts,
>>> > Mesh Categories, etc I would construct a series of document/entity
>>> > sets, like
>>> >
>>> > <dataConfig>
>>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>>> >
>>> >    <!-- BLOG ENTRIES -->
>>> >    <document name="blog_entries">
>>> >      <entity name="blog_entries" query="select
>>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>>> > from blog_entries">
>>> >        <field column="id" name="pk_i" />
>>> >        <field column="id" name="id" />
>>> >        <field column="title" name="text_t" />
>>> >        <field column="data" name="text_t" />
>>> >      </entity>
>>> >    </document>
>>> >
>>> >    <!-- MESH CATEGORIES -->
>>> >    <document name="mesh_category">
>>> >      <entity name="mesh_categories" query="select
>>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
>>> > mesh_categories">
>>> >        <field column="id" name="pk_i" />
>>> >        <field column="id" name="id" />
>>> >        <field column="name" name="text_t" />
>>> >        <field column="node_key" name="string" />
>>> >        <field column="name_fc" name="facet_value" />
>>> >        <field column="type" name="type_t" />
>>> >      </entity>
>>> >    </document>
>>> > </datasource>
>>> > </dataConfig>
>>> >
>>> >
>>> > Solr parses this just fine and allows me to issue a
>>> > /dataimport?command=full-import and it runs, but it only runs against
>>> > the "first" document (blog_entries). It doesnt run against the 2nd
>>> > document (mesh_categories).
>>> >
>>> > If I remove the 2 document elements and wrap both entity sets in just
>>> > one document tag, then both sets get indexed, which seemingly achieves
>>> > my goal. This just doesnt make sense from my understanding of how DIH
>>> > works. My 2 content types are indeed separate so they logically
>>> > represent two document types, not one.
>>> >
>>> > Is this correct? What am I missing here?
>>> >
>>> > Thanks
>>> > -Rupert
>>> >
>>>
>>
>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Mime
View raw message