manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bisonti Mario <Mario.Biso...@vimar.com>
Subject R: How to set Tika with ManifoldCF and Solr
Date Fri, 12 Oct 2018 13:58:59 GMT
Hallo Karl.
I found what is the problem.

In my Solr output connection, I had the “Included mime types:” populated as I said .
I saw that you haven’t that field populated.

I tried to empy it, infact you said me that it isn’t used when I uncheck the “Use the
Extract Update Handler:” etc. and it works!

I don’t obtain no more the EXCLUDEDMIMTYPE error, so, it seems that, even if that “Included
mime types:” is not used, if it is populated generates the EXCLUDEDMIMETYPE error (even
if the mime type specific was in that field)

Thanks a lot

Mario



Da: Karl Wright <daddywri@gmail.com>
Inviato: venerdì 12 ottobre 2018 15:23
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Just tried again here.

Compiled trunk from scratch, configured as instructed, and received no errors.  Example chunk
of Simple History output:

>>>>>>

10-12-2018 09:16:45.618

read document

C:\Users\kawright\Desktop\BIG Froggie _ Flickr - Photo Sharin...
g!_files\flickr-yahoo-logo.png

OK

1558

4667

10-12-2018 09:16:45.209

document ingest (solr)

file:/C:/Users/kawright/Desktop/BIG%20Froggie%20_%20Flickr%20...
-%20Photo%20Sharing!_files/3_004.js

OK

117855

23

10-12-2018 09:16:42.215

document ingest (solr)

file:/C:/Users/kawright/Desktop/HMM%20State%20Diagram.vsdx

OK

37

409

10-12-2018 09:16:41.769

extract [tika]

file:/C:/Users/kawright/Desktop/BIG%20Froggie%20_%20Flickr%20...
-%20Photo%20Sharing!_files/3_004.js

OK

117855

430

10-12-2018 09:16:41.156

read document

C:\Users\kawright\Desktop\BIG Froggie _ Flickr - Photo Sharin...
g!_files\3_004.js

OK

117854

4474

10-12-2018 09:16:39.255

extract [tika]

file:/C:/Users/kawright/Desktop/HMM%20State%20Diagram.vsdx

OK

37

1193

10-12-2018 09:16:37.930

read document

C:\Users\kawright\Desktop\HMM State Diagram.vsdx

OK

42427

5945

10-12-2018 09:16:35.816

document ingest (solr)

file:/C:/Users/kawright/Desktop/BIG%20Froggie%20_%20Flickr%20...
-%20Photo%20Sharing!_files/3.js

OK

54643

57

10-12-2018 09:16:34.572

document ingest (solr)

file:/C:/Users/kawright/Desktop/Ruth%20Wright/BlumShapiro2016...
.docx

OK

1916

23

10-12-2018 09:16:34.284

extract [tika]

file:/C:/Users/kawright/Desktop/BIG%20Froggie%20_%20Flickr%20...
-%20Photo%20Sharing!_files/3.js

OK

54643

21

10-12-2018 09:16:33.384

read document

C:\Users\kawright\Desktop\BIG Froggie _ Flickr - Photo Sharin...
g!_files\3.js

OK

54642

4545

10-12-2018 09:16:31.042

extract [tika]

file:/C:/Users/kawright/Desktop/Ruth%20Wright/BlumShapiro2016...
.docx

OK

1916

28

<<<<<<

Here's the solr output configuration:

>>>>>>
Parameters:

User ID=admin
ZooKeeper znode path=
Socket timeout=900
Server remove handler=/update
Included mime types=
Use extract update handler=false
Solr created date field name=
ZooKeeper client timeout=60
Solr modified date field name=
Solr core name=collection1
Server protocol=http
Realm=
Server name=localhost
Server status handler=/admin/ping
Password=********
Excluded mime types=
Commits=true
Maximum document length=1000000
Server port=8984
Connection timeout=60
Solr type=standard
Solr filename field name=
Commit within=
Solr id field name=id
Solr mime type field name=
ZooKeeper connect timeout=60
Collection=collection1
Server update handler=/update
Server web application=solr
Solr original size field name=
Solr indexed date field name=
Solr content field name=data


On Fri, Oct 12, 2018 at 5:03 AM Bisonti Mario <Mario.Bisonti@vimar.com<mailto:Mario.Bisonti@vimar.com>>
wrote:
Hallo.
I downloaded and compiled ManifoldCF 2.11 from scratch, I used Tika internal but I obtain
the same problem.
[cid:image002.jpg@01D4621B.29A03030]


Da: Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
Inviato: giovedì 11 ottobre 2018 19:29
A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: Re: How to set Tika with ManifoldCF and Solr

I cannot reproduce your problem.  Perhaps you can download a new instance and configure it
from scratch using the embedded tika?  If that works it should be possible to figure out what
the difference is.

Karl

On Thu, Oct 11, 2018, 12:23 PM Bisonti Mario <Mario.Bisonti@vimar.com<mailto:Mario.Bisonti@vimar.com>>
wrote:
I tried to update Solr, Tika server and ManifoldCF to the last versions.

I tried to add another Transformation before the TikaTransformation ti filter the alloweddocuments
as you suggested in another discussion but nothing..
I always have the same Result Code: EXCLUDEDMIMETYPE


I read other discussion ( https://lists.apache.org/thread.html/66a3f9780bbcc98e404e25f5a0e56a8a6c007448642c3bc15a366ed2@%3Cuser.manifoldcf.apache.org%3E<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F66a3f9780bbcc98e404e25f5a0e56a8a6c007448642c3bc15a366ed2%40%253Cuser.manifoldcf.apache.org%253E&data=01%7C01%7CMario.Bisonti%40vimar.com%7C144d9fca154a4381846508d63045d31b%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1&sdata=tE8g6XrqmpS1lkiK%2FDfzRj%2Biz66OMfRB4a8eGD1wf8o%3D&reserved=0>)
 but I don’t understand if they solved the issue

☹

Thanks a lot.
Mario






Da: Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
Inviato: giovedì 11 ottobre 2018 14:57
A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: Re: How to set Tika with ManifoldCF and Solr

When you don't check the "use extracting update handler" field is UNCHECKED, the mime types
you list are IGNORED.  Only "text" mime types are accepted by the Solr connection in that
case.  But that is exactly what the Tika extractor sends along, and many other people do this,
and I can make it work fine here, so I don't know what you are doing wrong.

Karl


On Thu, Oct 11, 2018 at 8:37 AM Bisonti Mario <Mario.Bisonti@vimar.com<mailto:Mario.Bisonti@vimar.com>>
wrote:
This is my solr output connection:

I tried to put content_type as “Mime type field name:” but the result is always the same

Could be that, unchecking the flag, ManifoldCF doesn’t use the mime types specified?

I am using a snapshot version of ManifoldCF of three monts  ago.




Da: Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
Inviato: giovedì 11 ottobre 2018 14:20
A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: Re: How to set Tika with ManifoldCF and Solr

I confirmed that both the Tika Service transformer and the Tika transformer check the same
exact mime type:

>>>>>>
  @Override
  public boolean checkMimeTypeIndexable(VersionContext pipelineDescription, String mimeType,
IOutputCheckActivity checkActivity)
    throws ManifoldCFException, ServiceInterruption
  {
    // We should see what Tika will transform
    // MHL
    // Do a downstream check
    return checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
  }
<<<<<<

So: please verify that your Solr connection is set up correctly and the "use extracting update
handler" box is UNCHECKED.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:16 AM Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
wrote:
When you uncheck the "use extracting update handler" checkbox, the Solr connection only accepts
text/plain, and no binary formats.  The Tika extractor, though, should set the mime type always
to "text/plain".  Since the Simple History says otherwise, I wonder if there's a problem with
the external Tika extractor.  Perhaps you can try the internal one to get your pipeline working
first?  If the external one does not send the right mime type, then we need to correct that
so you should open a ticket.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario <Mario.Bisonti@vimar.com<mailto:Mario.Bisonti@vimar.com>>
wrote:
Now the document isn’t ingested by solr because I obtain:

Solr connector rejected document due to mime type restrictions: (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)


But the mime type is on the tab


And the settings worked well when I used Tika inside solr.

Could you help me?
Thanks

Da: Bisonti Mario <Mario.Bisonti@vimar.com<mailto:Mario.Bisonti@vimar.com>>
Inviato: giovedì 11 ottobre 2018 14:03
A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: R: How to set Tika with ManifoldCF and Solr


My mistake…
As you wrote me I had to uncheck “use extracting update handler”

Now I have to understand the field mentioned in schema etc.

Da: Bisonti Mario <Mario.Bisonti@vimar.com<mailto:Mario.Bisonti@vimar.com>>
Inviato: giovedì 11 ottobre 2018 13:45
A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: R: How to set Tika with ManifoldCF and Solr

I see the job processed but without the document inside.
10-11-2018 13:32:25.649

job end

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1

10-11-2018 13:32:14.211

job start

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1





Have I to uncheck, on my Solr output connection the “Use the Extract Update Handler”?






Da: Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
Inviato: giovedì 11 ottobre 2018 13:36
A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Please have a look at your "Simple History" report to see why the documents aren't getting
indexed.

Thanks,
Karl


On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario <Mario.Bisonti@vimar.com<mailto:Mario.Bisonti@vimar.com>>
wrote:
Thanks Karl.
I tried, but it doesn’t index documents.
It seemes that it doesn’t see them?

Perhaps is the “Ignore Tika exception that I don’t know where to set in ManifoldCF  the
problem?





Da: Karl Wright <daddywri@gmail.com<mailto:daddywri@gmail.com>>
Inviato: giovedì 11 ottobre 2018 12:24
A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Hi Mario,

(1) When you use the Tika server externally, you do not get the boilerpipe HTML extractor
available for configuration and use.  That is because it's external now.
(2) In your Solr connection, you want to uncheck the box that says "use extracting update
handler", and you want to change the output handler from "/update/extract" to just "/update".

Karl


On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario <Mario.Bisonti@vimar.com<mailto:Mario.Bisonti@vimar.com>>
wrote:
Hallo.
I would like to use Tika server started from command line into ManifoldCF so, ManifoldCF as
Trasformation connector, process with Tika and index to the output connecto Solr.

I started Tika server:
java -jar /opt/tika/tika-server-1.19.1.jar

After, I created a transformation connection with TikaServer: localhost and Tika port 998
and connection works.

After, I created a job and in the Tab Connection I inserted the Transformation yet created
Before the Output Solr.



Note that I don’t see the tab “Excepition” and “Boilerplate”
Why this?

Furthermore, if I start the job, I see that Solr hangs with exception:
2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share] o.e.j.s.HttpChannel /solr/core_share/update/extract
java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
        at java.lang.Class.forName0(Native Method) ~[?:?]
        at java.lang.Class.forName(Class.java:374) ~[?:?]

infact, I renamed the tika .jar:
in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t use Tika because
I would like that Manifoldcfuses Tika buti t doesn’t work.

Have I to configure solr to don’t use Tika I suppose.

How to do this?

I see https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatafari.atlassian.net%2Fwiki%2Fspaces%2FDATAFARI%2Fpages%2F107708451%2FData%2BExtraction%2BTika%2BEmbedded%2Bin%2BSolr%2BDeactivation%2BConfiguration&data=01%7C01%7CMario.Bisonti%40vimar.com%7C144d9fca154a4381846508d63045d31b%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1&sdata=5bwGiqmYwROKV9ZjlDBgUXpYubll6iLF2NGOtTYXPEY%3D&reserved=0>
but I haven’t Datafari, so, in a Solr standard configuration, how could I deactivated the
tika ?

Thanks a lot

Mario

Mime
View raw message