lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gora Mohanty <g...@mimirtech.com>
Subject Re: Data Import Handler and Extract Handler
Date Thu, 27 Jun 2013 13:58:20 GMT
On 27 June 2013 13:42, Venter, Scott <Scott.Venter@rmb.co.za> wrote:
> Hi all,
>
> I am new to SOLR. I have been working through the SOLR 4 Cookbook and my experiences
so far have been great.
>
> I have worked through the extraction of PDF data recipe, and the Data import recipe.
I would now like to join these two things, i.e. I would like to do a data import from a Database
table of users, and then somehow associate indexed PDF data with rows that were imported.
>
> I have a conceptual link between rows in the database and pdf documents, but I don't
know how to make a physical link between the two in SOLR. For example, I know that user x
has pdf documents a, b and c.
>
> If I have imported my users into SOLR using Data Import Handler, how would I
>
> 1) import and associate the pdf documents using the extract mechanism, in such a way
that there is a link between user x and the 3 pdf documents as described above?
[...]

Where are your PDF documents? Presumably on the filesystem
or available from a web service. What you can do is to have
two datasources in your DIH configuration file:
* The first one is a JdbcDataSource that extracts data from a
   database. Presumably, you already have this working.
* The second is a BinFileDataSource assuming that your
   PDF files are on the filesystem.
* In the top-level entity, select the user and the names of the
  associated PDF files.
* Use a nested inner entity with the "dataSource" attribute set
  to the BinFileDataSource, and use the TikaEntityProcessor
  to index the PDF files. The documentation on this is a little
  scattered, but see:
  http://wiki.apache.org/solr/TikaEntityProcessor
  http://lucene.472066.n3.nabble.com/problem-to-indexing-pdf-directory-td3749554.html

Regards,
Gora

Mime
View raw message