manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Parker <mpar...@apogeeintegration.com>
Subject Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR
Date Mon, 27 Feb 2012 18:47:24 GMT
Thanks for the insights Karl. I'll have to give this a little more thought.

On Mon, Feb 27, 2012 at 1:22 PM, Karl Wright <daddywri@gmail.com> wrote:

> If you've got a mix of data and only some of it comes through
> ManifoldCF, you can still use the ManifoldCF-generated URL for those
> that originate with ManifoldCF.  This should even work for documents
> from the JCIFS connector - even though the default urls from this
> connector are "file:" style, there's a mapping you can set up for
> documents from that connector that maps to a URL format of your
> choice.  Similarly, most JDBC document urls can readily be constructed
> as part of the database queries that you provide for the job.  So it
> does not sound like your servlet would have to do anything custom for
> any of the data that comes from ManifoldCF at this time, as long as
> you define your connections and jobs with some care as to the URLs
> they will produce.
>
> Thanks,
> Karl
>
>
> On Mon, Feb 27, 2012 at 11:25 AM, Matthew Parker
> <mparker@apogeeintegration.com> wrote:
> > Karl,
> >
> > I'm importing data from a number of sources to include: SharePoint, File
> > shares, and an ORACLE database. The files/records are indexed by SOLR.
> >
> > Right now, some of the import is done through custom SOLR's Data Import
> > Handler facilities. I'm hoping to move away from that in the future.
> >
> > We are also aggregating some of the file share data into custom views on
> the
> > web client. Lots of preprocessing.
> >
> > All of this is stored in the SOLR index with metadata related as to how
> to
> > display it within our custom web client. If the result is a certain type,
> > we have custom templates that are display as a result of that.
> >
> > Manifold is a good solution for the SharePoint data. We don't really do
> any
> > custom processing on it other than strip HTML from the text.
> > It's the database and file share information  that adds some challenges.
> I'm
> > hoping to get SOLR out of the text processing pipeline, and just
> > let it index data. We are moving to Pentaho at some point, and we'll
> > probably handle most of the custom metadata processing there.
> > At some point, we'll possibly integrate Pentaho as an output connection
> in
> > Manifold.
> >
> > Thanks,
> >
> > Matt
> >
> > On Mon, Feb 27, 2012 at 10:04 AM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>
> >> Please see my response interleaved below.
> >>
> >> On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
> >> <mparker@apogeeintegration.com> wrote:
> >> > I'm trying to push data into SOLR..
> >> >
> >> > Is there a way to transform the metadata coming in from different data
> >> > sources like SharePoint, and the File Share, prior to posting it into
> >> > SOLR?
> >> >
> >>
> >> In general, ManifoldCF does not have data transformation abilities.
> >> With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
> >> extract content from documents and to perform transformations to
> >> document metadata etc.  It is possible that at some point it will be
> >> possible to do more transformations in ManifoldCF in order to support
> >> search engines that don't have a pipeline, but that is currently not
> >> available.
> >>
> >> > For instance, documents have metadata specifying their file path. I
> need
> >> > to
> >> > transform that to a URL I can use within SOLR to retrieve that
> document
> >> > through a servlet that I wrote.
> >> >
> >>
> >> The ManifoldCF model is that a connector creates a URL for each
> >> document that it indexes, using whatever makes sense for that
> >> particular repository to get you back to the document in question.
> >> So, for instance, Documentum documents will use URLs that point at
> >> Documentum's Webtop web application.
> >>
> >> It would be helpful to understand more precisely what you are trying
> >> to do.  You could, for instance, modify your servlet to redirect to
> >> the ManifoldCF-generated URL.  It gets indexed into Solr as the "id"
> >> field.
> >>
> >> > Also, based on specific metadata that I'm seeing in the documents, I
> >> > might
> >> > want to conditionally add populate other fields in SOLR index.
> >> >
> >>
> >> That sounds like a job for the Tika pipeline to me.
> >>
> >> Thanks,
> >> Karl
> >>
> >> > ------------------------------
> >> > This e-mail and any files transmitted with it may be proprietary.
> >> >  Please
> >> > note that any views or opinions presented in this e-mail are solely
> >> > those of
> >> > the author and do not necessarily represent those of Apogee
> Integration.
> >> >
> >
> >
> > ------------------------------
> > This e-mail and any files transmitted with it may be proprietary.  Please
> > note that any views or opinions presented in this e-mail are solely
> those of
> > the author and do not necessarily represent those of Apogee Integration.
> >
>

Mime
View raw message