lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tod <listac...@gmail.com>
Subject Re: Data Import Handler Rich Format Documents
Date Tue, 22 Jun 2010 16:29:15 GMT
On 6/18/2010 2:42 PM, Chris Hostetter wrote:
> : > I don't think DIH can do that, but who knows, let's see what others say.
> 
> : Looks like the ExtractingRequestHandler uses Tika as well.  I might just use
> : this but I'm wondering if there will be a large performance difference between
> : using it to batch content in over rolling my own Transformer?
> 
> I'm confused ... You're using DIH, and some of your fields are URLs to 
> documents that you want to parse with Tika?
> 
> Why would you need a custom Transformer?
> 
> http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
> http://wiki.apache.org/solr/TikaEntityProcessor
> 
> -Hoss

Ok, I'm trying to integrate the TikaEntityProcessor as suggested.  I'm 
using Solr Version: 1.4.0 and getting the following error:

java.lang.ClassNotFoundException: Unable to load BinURLDataSource or 
org.apache.solr.handler.dataimport.BinURLDataSource

curl -s http://test.html|curl 
http://localhost:9080/solr/update/extract?extractOnly=true --data-binary 
@-  -H 'Content-type:text/html'

... works fine so presumably my Tika processor is working.


My data-config.xml looks like this:

<dataConfig>
   <dataSource type="JdbcDataSource"
     driver="oracle.jdbc.driver.OracleDriver"
     url="jdbc:oracle:thin:@whatever:12345:whatever"
     user="me"
     name="ds-db"
     password="secret"/>

   <dataSource type="BinURLDataSource"
     name="ds-url"/>

   <document>
     <entity name="my_database"
      dataSource="ds-db"
      query="select * from my_database where rownum &lt;=2">
       <field column="CONTENT_ID"                name="content_id"/>
       <field column="CMS_TITLE"                 name="cms_title"/>
       <field column="FORM_TITLE"                name="form_title"/>
       <field column="FILE_SIZE"                 name="file_size"/>
       <field column="KEYWORDS"                  name="keywords"/>
       <field column="DESCRIPTION"               name="description"/>
       <field column="CONTENT_URL"               name="content_url"/>
     </entity>

     <entity name="my_database_url"
      dataSource="ds-url"
      query="select CONTENT_URL from my_database where 
content_id='${my_database.CONTENT_ID}'">
      <entity processor="TikaEntityProcessor"
       dataSource="ds-url"
       format="text">
       url="http://www.mysite.com/${my_database.content_url}"
       <field column="text"/>
      </entity>
     </entity>

   </document>
</dataConfig>

I added the entity name="my_database_url" section to an existing 
(working) database entity to be able to have Tika index the content 
pointed to by the content_url.

Is there anything obviously wrong with what I've tried so far?


Thanks - Tod

Mime
View raw message