lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noble Paul നോബിള്‍ नोब्ळ् <noble.p...@corp.aol.com>
Subject Re: TikaEntityProcessor not working?
Date Mon, 31 May 2010 09:59:10 GMT
BinFileDataSource  will only work with file, Try FieldStreamDataSource

On Mon, May 31, 2010 at 3:30 AM, Brad Greenlee <brad@footle.org> wrote:

> Hi. I'm trying to get Solr to index a database in which one column is a
> filename of a PDF document I'd like to index. My configuration looks like
> this:
>
> <dataConfig>
>  <dataSource name="ds-db" driver="com.mysql.jdbc.Driver"
> url="jdbc:mysql://localhost/document_db" user="user" password="password"
> readOnly="true"/>
>  <dataSource name="ds-file" type="BinFileDataSource"/>
>  <document name="documents">
>    <entity name="document" dataSource="ds-db" query="select * from
> documents">
>      <entity processor="TikaEntityProcessor"
> url="/some/path/${document.filename}" dataSource="ds-file" format="text">
>        <field column="text" />
>      </entity>
>    </entity>
>  </document>
> </dataConfig>
>
> I'm using Solr from trunk (as of two days ago). The import process
> completes without errors, and it picks up the columns from the database, but
> not the content from the PDF file. It is definitely trying to access the PDF
> file, for if I give it an incorrect path name, it complains. It doesn't seem
> to be attempting to index the PDF, though, as it completes in about 40ms,
> whereas if I import the PDF via the ExtractingRequestHandler, it takes about
> 11 seconds to index it.
>
> I've also tried the tika example in example-DIH and that doesn't seem to
> index anything, either. Am I doing something wrong, or is this just not
> working yet?
>
> Cheers,
>
> Brad
>
>


-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message