lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Serba <>
Subject Re: Data Import Handler Rich Format Documents
Date Fri, 18 Jun 2010 21:55:26 GMT
I think you can use existing ExtractingRequestHandler to do the job,
i.e. add child entity to your DIH metadata

<dataSource type="JdbcDataSource" name="db" ... />
<dataSource type="URLDataSource" name="solr" />
<entity name="metadata" query="select id, title, url from metadata"
    <entity processor="PlainTextEntityProcessor" name="content"
        <field column="plainText" name="content"/>

That's not working example, just basic idea, you still need to
uri_escape ${metadata.url} reference probably using some transformer
(regexp, javascript?) and extract file content from ERH xml response
using xpath and probably do some html stripping.


On Fri, Jun 18, 2010 at 4:51 PM, Tod <> wrote:
> I have a database containing Metadata from a content management system.
>  Part of that data includes a URL pointing to the actual published document
> which can be an HTML file or a PDF, MS Word/Excel/Powerpoint, etc.
> I'm already indexing the Metadata and that provides a lot of value.  The
> customer however would like that the content pointed to by the URL also be
> indexed for more discrete searching.
> This article at Lucid:
> describes the process of coding a custom transformer.  A separate article
> I've read implies Nutch could be used to provide this functionality too.
> What would be the best and most efficient way to accomplish what I'm trying
> to do?  I have a feeling the Lucid article might be dated and there might
> ways to do this now without any coding and maybe without even needing to use
> Nutch.  I'm using the current release version of Solr.
> Thanks in advance.
> - Tod

View raw message