manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: How to set Tika with ManifoldCF and Solr
Date Thu, 11 Oct 2018 12:16:28 GMT
When you uncheck the "use extracting update handler" checkbox, the Solr
connection only accepts text/plain, and no binary formats.  The Tika
extractor, though, should set the mime type always to "text/plain".  Since
the Simple History says otherwise, I wonder if there's a problem with the
external Tika extractor.  Perhaps you can try the internal one to get your
pipeline working first?  If the external one does not send the right mime
type, then we need to correct that so you should open a ticket.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario <Mario.Bisonti@vimar.com>
wrote:

> Now the document isn’t ingested by solr because I obtain:
>
>
>
> Solr connector rejected document due to mime type restrictions:
> (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
>
>
>
>
>
> But the mime type is on the tab
>
>
>
>
>
> And the settings worked well when I used Tika inside solr.
>
>
>
> Could you help me?
>
> Thanks
>
>
>
> *Da:* Bisonti Mario <Mario.Bisonti@vimar.com>
> *Inviato:* giovedì 11 ottobre 2018 14:03
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
>
>
> My mistake…
>
> As you wrote me I had to uncheck “use extracting update handler”
>
>
>
> Now I have to understand the field mentioned in schema etc.
>
>
>
> *Da:* Bisonti Mario <Mario.Bisonti@vimar.com>
> *Inviato:* giovedì 11 ottobre 2018 13:45
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
> I see the job processed but without the document inside.
>
> 10-11-2018 13:32:25.649
>
> job end
>
> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>
> 0
>
> 1
>
> 10-11-2018 13:32:14.211
>
> job start
>
> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>
> 0
>
> 1
>
>
>
>
>
>
>
>
>
> Have I to uncheck, on my Solr output connection the “Use the Extract
> Update Handler”?
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright <daddywri@gmail.com>
> *Inviato:* giovedì 11 ottobre 2018 13:36
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> Please have a look at your "Simple History" report to see why the
> documents aren't getting indexed.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
> Thanks Karl.
>
> I tried, but it doesn’t index documents.
>
> It seemes that it doesn’t see them?
>
>
>
> Perhaps is the “Ignore Tika exception that I don’t know where to set in
> ManifoldCF  the problem?
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright <daddywri@gmail.com>
> *Inviato:* giovedì 11 ottobre 2018 12:24
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> Hi Mario,
>
>
>
> (1) When you use the Tika server externally, you do not get the boilerpipe
> HTML extractor available for configuration and use.  That is because it's
> external now.
>
> (2) In your Solr connection, you want to uncheck the box that says "use
> extracting update handler", and you want to change the output handler from
> "/update/extract" to just "/update".
>
>
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
> Hallo.
>
> I would like to use Tika server started from command line into ManifoldCF
> so, ManifoldCF as Trasformation connector, process with Tika and index to
> the output connecto Solr.
>
>
>
> I started Tika server:
> java -jar /opt/tika/tika-server-1.19.1.jar
>
>
>
> After, I created a transformation connection with TikaServer: localhost
> and Tika port 998 and connection works.
>
>
>
> After, I created a job and in the Tab Connection I inserted the
> Transformation yet created Before the Output Solr.
>
>
>
>
>
>
>
> Note that I don’t see the tab “Excepition” and “Boilerplate”
>
> Why this?
>
>
>
> Furthermore, if I start the job, I see that Solr hangs with exception:
>
> 2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share]
> o.e.j.s.HttpChannel /solr/core_share/update/extract
>
> java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
>
>         at java.lang.Class.forName0(Native Method) ~[?:?]
>
>         at java.lang.Class.forName(Class.java:374) ~[?:?]
>
>
>
> infact, I renamed the tika .jar:
> in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t
> use Tika because I would like that Manifoldcfuses Tika buti t doesn’t work.
>
>
>
> Have I to configure solr to don’t use Tika I suppose.
>
>
>
> How to do this?
>
>
>
> I see
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatafari.atlassian.net%2Fwiki%2Fspaces%2FDATAFARI%2Fpages%2F107708451%2FData%2BExtraction%2BTika%2BEmbedded%2Bin%2BSolr%2BDeactivation%2BConfiguration&data=01%7C01%7CMario.Bisonti%40vimar.com%7C94121032337b4b8c0ed308d62f718964%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1&sdata=M%2B%2B%2F5IFICTgRKDcmvAwrANaTaS308x1NoR3NsbQUSrY%3D&reserved=0>
> but I haven’t Datafari, so, in a Solr standard configuration, how could I
> deactivated the tika ?
>
>
>
> Thanks a lot
>
>
>
> Mario
>
>
>
>

Mime
View raw message