manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Mapping Webcrawler metadata
Date Mon, 14 Jul 2014 13:02:16 GMT
Hi Andrea,

The web crawler connector sends along all HTTP header values EXCEPT for
certain explicitly excluded ones as metadata.  The excluded headers are
those which are involved in authorization or which would change on every
fetch.

The kinds of metadata you list above seems to not be coming from the web
connector, but rather from Solr Cell (Tika), which is the extracting update
handler in Solr.  I have no idea what Tika can possibly generate.  The Tika
generated metadata fields cannot be mapped using the Solr Field Mapping tab
because that extraction takes place in Solr, not in ManifoldCF.

MCF 1.7 will have the option of running Tika locally in MCF, as a
transformation connector, and not using Solr's extracting update handler,
so you should have better control when 1.7 is released.

Thanks,
Karl



On Mon, Jul 14, 2014 at 7:16 AM, Andrea Piemontese <zerologiko@gmail.com>
wrote:

> Hi All,
>
> I'm trying to map which informations/metadata will be extracted by the
> WebcrawlerConnector to be imported and indexed by the SolrConnector.
>
> Executing a Job with WebcrawlerConnector as input and SolrConnector as
> output, the metadata I get in SolR are the following:
>
> - links
> - id
> - author
> - authors
> - title
> - content_type
> - resourcename
> - content
> - _version_
>
> Is there a way to know which metadata are extracted by the
> WebcrawlerConnector?
> In other words, which metadata can I use in the "Solr Field Mapping"
> tab of the job configuration?
>
> Thanks a lot in advance.
>

Mime
View raw message