manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: MCF not indexing documents due to mime-type
Date Fri, 22 Dec 2017 01:35:22 GMT
Well, there are some differences; "Solr Cell" (as they used to call it)
generates a couple of fields that the standard Tika extractor in MCF
won't.  But other than that it should work.

Note that you can still use the extracting update handler in the solr
connector; since the input will always be text/plain Tika shouldn't do
anything to the document on the Solr side.  If that doesn't happen to be
true, you can use the standard Solr input handler, but bear in mind that
this handler requires memory buffering on the MCF side so we insist you
give a limit on the size of the content sent to Solr for indexing in that
mode.

Thanks,
Karl


On Thu, Dec 21, 2017 at 7:21 PM, Phillip Rhodes <motley.crue.fan@gmail.com>
wrote:

> OK, it looks like the root of the problem I was seeing, metadata
> winding up mixed in with the content, is ultimately a bug in Solr.
> <https://issues.apache.org/jira/browse/SOLR-9178>
>
> It seems that if you use the "Tika built into Solr" approach this is
> just what you get.  The answer seems to be "do the Tika processing
> outside of Solr".
>
> So now my question vis-a-vis ManifoldCF is this: can I achieve the
> scenario of having MCF index everything, and send it all to Solr,
> while *not* using the ExtractingRequestHandler if I run Tika in MCF
> directly?  My naive understanding is that the "Tika Content Extractor"
> should let me accomplish this.  Can anyone confirm if that is correct?
>
>
> Thanks
>
>
> Phil
>
> This message optimized for indexing by NSA PRISM
>
>
> On Wed, Dec 20, 2017 at 7:53 AM, Karl Wright <daddywri@gmail.com> wrote:
> > Hi Phil,
> >
> > For some output connectors, they *only* accept text documents.  That's
> why
> > you need to run your documents through Tika first.  So your original
> setup
> > was right.
> >
> > If you are still using ElasticSearch, you can make it accept non-text
> > documents only by specifying the mapper attachment in the output
> connection
> > configuration.
> >
> >
> >
> > Karl
> >
> >
> > On Wed, Dec 20, 2017 at 4:25 AM, Phillip Rhodes <
> motley.crue.fan@gmail.com>
> > wrote:
> >>
> >> MCF folks:
> >>
> >> I'm about to tear my hair out over this one... I just realized that
> >> I've been running MCF with the "Use the Extract Update Handler:"
> >> option checked.  Suspecting this might be related to another issue I
> >> was having (content was not being stored in the field named in the
> >> "Content field name:" option in MCF), I turned this option off.
> >>
> >> Now, MCF happily rejects nearly every document in my repository with
> this:
> >>
> >> Result Code: EXCLUDEDMIMETYPE
> >> Result Description: Excluding document because of mime type
> >> (application/pdf)
> >> (and so on for many other mime types)
> >>
> >> So... this is *not* what I would expect to happen as I have nothing at
> >> all listed in the "excluded mime types" setting for this output
> >> connector.  With nothing explicitly excluded, I would (perhaps
> >> naively) expect all mime types to be sent to Solr.
> >>
> >> But what makes it even worse is this: even when I explicitly add types
> >> (for example, application/pdf) to the "included mime types" setting
> >> and re-index, I *still* get the same message and no PDF files are
> >> indexed.
> >>
> >> Any ideas?  Is this a bug, or is there something else I need to do?
> >>
> >>
> >>
> >> Thanks,
> >>
> >>
> >> Phil
> >> ~~~
> >> This message optimized for indexing by NSA PRISM
> >
> >
>

Mime
View raw message