lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sascha Szott <>
Subject Re: Building documents using content residing both in database tables and text files
Date Tue, 11 Aug 2009 17:35:30 GMT
Hi Noble,

Noble Paul wrote:
> isn't it possible to do this by having two datasources (one Js=dbc and
> another File) and two entities . The outer entity can read from a DB
> and the inner entity can read from a file.
Yes, it is. Here's my db-data-config.xml file:

<!-- definition of data sources -->
<dataSource name="ds.database"
             password="..." />
<dataSource name="ds.filesystem"
             type="FileDataSource" />

<!-- building the document using both db and file content
      (files are stored in /tmp/<recordId>)
<document name="doc">
   <entity name="t" query="select * from t" dataSource="ds.database">
     <field column="id" name="id" />
     <field column="title" name="title" />
     <entity name="dir"
             rootEntity="false" >
       <entity name="file"
               stream="false" >
         <field column="text" xpath="/root" />

Only one additional adjustment has to be made: Since I'm using Solr 1.3 
and it comes without PlainTextEntityProcessor, I have to transform my 
plain text files in xml files by surrounding the content with a root 
element. That's all!

> On Tue, Aug 11, 2009 at 8:05 PM, Sascha Szott<> wrote:
>> Hello,
>> is it possible (and if it is, how can I accomplish it) to configure DIH to
>> build up index documents by using content that resides in different data
>> sources?
>> Here is an example scenario:
>> Let's assume we have a table T with two columns, ID (which is the primary
>> key of T) and TITLE. Furthermore, each record in T is assigned a directory
>> containing text files that were generated out of pdf documents by using
>> Tika. A directory name is build by using the ID of the record in T
>> associated to that directory, e.g. all text files associated to a record
>> with id = 101 are stored in direcory 101.
>> Is there a way to configure DIH such that it uses ID, TITLE and the content
>> of all related text files when building a document (the documents should
>> have three fields: id, title, and text)?
>> Furthermore, as you may have noticed, a second question arises naturally:
>> Will there be any integration of Solr Cell and DIH in an upcoming release,
>> so that it would be possible to directly use the pdf documents instead of
>> the extracted text files that were generated outside of Solr?
> This is something I wish to see. But there has been no user request
> yet. You can raise an issue and it can be looked upon
I've raised issue SOLR-1358.


View raw message