lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arunas Spurga <arunas2...@gmail.com>
Subject Re: Indexing PDF files in SqlBase database
Date Wed, 03 Apr 2019 18:22:27 GMT
Yes, I know the reasons why put this work on a client rather than use Solr
directly and it should be maybe the next my task.
But I need to finish first my task - index a pdf files stored in SqlBase
database. The pdf files are pretty simple, sometimes only dozens text lines.

Regards,

Aruna

On Wed, Apr 3, 2019 at 5:03 PM Erick Erickson <erickerickson@gmail.com>
wrote:

> For a lot of reasons, I greatly prefer to put this work on a client rather
> than use Solr directly. Here’s a place to get started, it connects to a DB
> and also scans local file directory for docs to push through (local) Tika
> and index. So you should be able to modify it relatively easily to get the
> data from SqlBase, read the associated PDF, combine the two and send to
> Solr.
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> The code itself is a bit old, but illustrates the process.
>
> Best,
> Erick
>
> > On Apr 2, 2019, at 11:46 PM, Arunas Spurga <arunas2801@gmail.com> wrote:
> >
> > Hello,
> >
> > I got a task to index in Solr 7.71 a PDF files which are stored in
> SqlBase
> > database. I did half the job - I can to index all table fields, I can do
> a
> > search in these fields except field in which is stored a pdf file
> content.
> > As I am ttotally new in Solr, spent unsuccessfully a lot a time trying to
> > understand how to force to extract and index field with pdf content. I
> need
> > a help.
> >
> > Regards,
> >
> > Aruna
> >
> > in solrconfig.xml i have
> >
> >
> > * <lib
> dir="${solr.install.dir:../../../..}/contrib/dataimporthandler/lib"
> > regex=".*\.jar" />  <lib dir="${solr.install.dir:../../../..}/dist/"
> > regex="solr-dataimporthandler-.*\.jar" /> *
> > *  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib"
> > regex=".*\.jar" />*
> > *  <lib dir="${solr.install.dir:../../../..}/dist/"
> > regex="solr-cell-\d.*\.jar" />*
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *<requestHandler name="/update/extract"
> > startup="lazy"
> > class="solr.extraction.ExtractingRequestHandler" >    <lst
> > name="defaults">      <str name="lowernames">true</str>      <str
> > name="fmap.meta">ignored_</str>      <str
> > name="fmap.content">_text_</str>    </lst>  </requestHandler>*
> >
> >
> >
> >
> >
> > *<requestHandler name="/dataimport"
> > class="org.apache.solr.handler.dataimport.DataImportHandler">   <lst
> > name="defaults">    <str name="config">db-data-config.xml</str> 
 </lst>
> > </requestHandler>*
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> *---------------------------------------------------------------------------------------------------------------------------------------------db-data-config.xml<dataConfig><dataSource
> > type="JdbcDataSource"
> > driver="jdbc.unify.sqlbase.SqlbaseDriver"
> > url="jdbc:sqlbase://localhost:2155/PDFDOCS"
> > user="sysadm"            password="sysadm" />   <document>  <entity
> > name="PDFDOCUMENTS" query="select ID, PDOCUMENT, UNIT from SYSADM.DOCS">
> >  <field column="ID" name="idx" />       <field column="PDOCUMENT"
> > name="PDF" />        <field column="UNIT" name="division" />    </entity>
> > </document></dataConfig>*
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message