manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: How to determine the set of all possible fields in MCF output?
Date Sat, 14 Oct 2017 23:17:49 GMT
Hi Phil,

You are correct in asserting that in MCF it is the sum total of all the
connections that the document passes through that determine its attribute
set.  That includes transformation connections as well as the repository
connection.

Tika is one connection that does add a lot of fields and these depend not
only on the configuration of the Tika connection, but also on the kind of
document being extracted.  If you want to figure out the sum total of
what's possible, you will need to consult the Tika documentation.  And yes,
the field names Tika generates are created based on what Tika finds in the
document.

Alternatively, you can configure your job to send output to a null output
connection.  This connection records all attribute information for each
document in the simple history, so you can get an idea what to expect.

I'm a little confused about your statement that Tika runs even when it's
not in a job's pipeline.  That's not actually true, so I'm wondering what
you are seeing.

Thanks,
Karl


On Sat, Oct 14, 2017 at 6:39 PM, Phillip Rhodes <motley.crue.fan@gmail.com>
wrote:

> Hi all, I've been working with MCF the past few days and am very happy
> with what it lets me do, and I have a pipeline going from my
> repository to Solr which works fine.  But there is one point I clearly
> don't understand, which is:
>
> How do you know exactly what fields are going to be output in a given
> configuration?  I found that i had to resort to trial and error to
> tweak my Solr schema to avoid "undefined field xxxxx" errors from
> Manifold when trying to write to Solr.  Now to be fair, clearly I
> could just ignore any fields I don't specifically know I want, but I'd
> like to understand how this works.
>
> Is it the case that the initial set of fields depends on the
> repository connector?  I found that I seemed to get some Alfresco
> specific stuff when reading from Alfresco, as opposed to what I got
> from a simple dummy file-system repo I was initially experimenting
> with.
>
> It also seems that Tika adds some fields, (actually a lot of fields)
> even when you don't have a Tika transform wired in explicitly?   Is it
> the case that you need to put in an explicit Tika transform if you
> want to control which fields are contributed by Tika?
>
> And on that point, is there a master list of possible fields that TIka
> will emit, or is Tika just transforming the names of metadata fields
> in the documents it encounters, and programmatically generating a
> field name?
>
>
> Any and all help on understanding how this works is greatly appreciated...
>
>
> Phil
> ~~~~
> This message optimized for indexing by NSA PRISM
>

Mime
View raw message