manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 15:29:55 GMT
Did you put the dom4j jar in the options.env classpath?

Karl


On Thu, Aug 31, 2017 at 11:23 AM, Beelz Ryuzaki <i93othman@gmail.com> wrote:

> I moved back both the jars you mentioned and a different is showing. You
> will find the stack trace attached.
>
> Thanks,
> Othman
>
> On Thu, 31 Aug 2017 at 17:09, Karl Wright <daddywri@gmail.com> wrote:
>
>> I've looked at the dependencies; you should not have moved poi-3.15.jar.
>> Please move that back, and commons-collections4-4.1.jar too.
>>
>> You *will* need to move curvesapi-1.04.jar though.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> If you include poi.jar, then all dependencies of poi.jar must also be
>>> included.  This would mean that curvesapi-1.04.jar and
>>> commons-collections4-4.1.jar should also be included.
>>>
>>> Karl
>>>
>>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> I added the two jars that you have mentioned and another one :
>>>> poi-3.15.jar . Unfortunately, there is another error showing. This time,
it
>>>> concerns excel files. You will find attached the stack trace.
>>>>
>>>> Othman.
>>>>
>>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Hi Othman,
>>>>>
>>>>> Yes, this shows that the jar we moved calls back into another jar,
>>>>> which will also need to be moved.  *That* jar has yet another dependency
>>>>> too.
>>>>>
>>>>> The list of jars is thus extended to include:
>>>>>
>>>>> poi-ooxml-3.15.jar
>>>>> dom4j-1.6.1.jar
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You will find attached the stack trace. My apologies for the bad
>>>>>> quality of the image, I'm doing my best to send you the stack trace
as I
>>>>>> don't have the right to send documents outside the company.
>>>>>>
>>>>>> Thank you for your time,
>>>>>>
>>>>>> Othman
>>>>>>
>>>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>
>>>>>>> Once again, I need a stack trace to diagnose what the problem
is.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Oh, actually it didn't solve the problem. I looked into the
log
>>>>>>>> file and saw the following error:
>>>>>>>>
>>>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>>>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>>>>>>>
>>>>>>>> Maybe another jar is missing ?
>>>>>>>>
>>>>>>>> Othman.
>>>>>>>>
>>>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I have tried what you told me to do, and you expected
the crawling
>>>>>>>>> resumed. How about the regular expressions? How can I
make complex regular
>>>>>>>>> expressions in the job's paths tab ?
>>>>>>>>>
>>>>>>>>> Thank you very much for your help.
>>>>>>>>>
>>>>>>>>> Othman.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Ok, I will try it right away and let you know if
it works.
>>>>>>>>>>
>>>>>>>>>> Othman.
>>>>>>>>>>
>>>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Oh, and you also may need to edit your options.env
files to
>>>>>>>>>>> include them in the classpath for startup.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright
<daddywri@gmail.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If you are amenable, there is another workaround
you could
>>>>>>>>>>>> try.  Specifically:
>>>>>>>>>>>>
>>>>>>>>>>>> (1) Shut down all MCF processes.
>>>>>>>>>>>> (2) Move the following two files from connector-common-lib
to
>>>>>>>>>>>> lib:
>>>>>>>>>>>>
>>>>>>>>>>>> xmlbeans-2.6.0.jar
>>>>>>>>>>>> poi-ooxml-schemas-3.15.jar
>>>>>>>>>>>>
>>>>>>>>>>>> (3) Restart everything and see if your crawl
resumes.
>>>>>>>>>>>>
>>>>>>>>>>>> Please let me know what happens.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright
<
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One simple workaround is to use the external
Tika server
>>>>>>>>>>>>> transformer rather than the embedded
Tika Extractor.  I'm still looking
>>>>>>>>>>>>> into why the jar is not being found.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz
Ryuzaki <
>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I'm actually using the latest
binary version, and my job
>>>>>>>>>>>>>> got stuck on that specific file.
>>>>>>>>>>>>>> The job status is still Running.
You can see it in the
>>>>>>>>>>>>>> attached file. For your information,
the job started yesterday.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl
Wright <daddywri@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It looks like a dependency of
Apache POI is missing.
>>>>>>>>>>>>>>> I think we will need a ticket
to address this, if you are
>>>>>>>>>>>>>>> indeed using the binary distribution.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57
AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm actually using the binary
version. For security
>>>>>>>>>>>>>>>> reasons, I can't send any
files from my computer. I have copied the stack
>>>>>>>>>>>>>>>> trace and scanned it with
my cellphone. I hope it will be helpful.
>>>>>>>>>>>>>>>> Meanwhile, I have read the
documentation about how to restrict the crawling
>>>>>>>>>>>>>>>> and I don't think the '|'
works in the specified. For instance, I would
>>>>>>>>>>>>>>>> like to restrict the crawling
for the documents that counts the 'sound'
>>>>>>>>>>>>>>>> word . I proceed as follows:
*(SON)* . the document is with capital letters
>>>>>>>>>>>>>>>> and I noticed that it didn't
take it into consideration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40,
Karl Wright <
>>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The way you restrict
documents with the windows share
>>>>>>>>>>>>>>>>> connector is by specifying
information on the "Paths" tab in jobs that
>>>>>>>>>>>>>>>>> crawl windows shares.
 There is end-user documentation both online and
>>>>>>>>>>>>>>>>> distributed with all
binary distributions that describe how to do this.
>>>>>>>>>>>>>>>>> Have you found it?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017
at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>> i93othman@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you for your
response, I will start using zookeeper
>>>>>>>>>>>>>>>>>> and I will let you
know if it works. I have another question to ask.
>>>>>>>>>>>>>>>>>> Actually, I need
to make some filters while crawling. I don't want to crawl
>>>>>>>>>>>>>>>>>> some files and some
folders. Could you give me an example of how to use the
>>>>>>>>>>>>>>>>>> regex. Does the regex
allow to use /i to ignore cases ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017
at 19:53, Karl Wright <
>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> File-based sync
is deprecated because people often have
>>>>>>>>>>>>>>>>>>> problems with
getting file permissions right, and they do not understand
>>>>>>>>>>>>>>>>>>> how to shut processes
down cleanly, and zookeeper is resilient against
>>>>>>>>>>>>>>>>>>> that.  I highly
recommend using zookeeper sync.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ManifoldCF is
engineered to not put files into memory so
>>>>>>>>>>>>>>>>>>> you do not need
huge amounts of memory.  The default values are more than
>>>>>>>>>>>>>>>>>>> enough for 35,000
files, which is a pretty small job for ManifoldCF.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Aug 30,
2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>> i93othman@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'm actually
not using zookeeper. i want to know how is
>>>>>>>>>>>>>>>>>>>> zookeeper
different from file based sync? I also need a guidance on how to
>>>>>>>>>>>>>>>>>>>> manage my
pc's memory. How many Go should I allocate for the start-agent of
>>>>>>>>>>>>>>>>>>>> ManifoldCF?
Is 4Go enough in order to crawler 35K files ?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, 30
Aug 2017 at 16:11, Karl Wright <
>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Your
disk is not writable for some reason, and that's
>>>>>>>>>>>>>>>>>>>>> interfering
with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I would
suggest two things:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (1) Use
Zookeeper for sync instead of file-based sync.
>>>>>>>>>>>>>>>>>>>>> (2) Have
a look if you still get failures after that.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed,
Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>> i93othman@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi
Mr Karl,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thank
you Mr Karl for your quick response. I have
>>>>>>>>>>>>>>>>>>>>>> looked
into the ManifoldCF log file and extracted the following warnings :
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -
Attempt to set file lock
>>>>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.
>>>>>>>>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>>>>>>>>>>>>> (Lowercase)
Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -
Couldn't write to lock file; disk may be full.
>>>>>>>>>>>>>>>>>>>>>> Shutting
down process; locks may be left dangling. You must cleanup before
>>>>>>>>>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ES
(lowercase) synapses being the elasticsearch
>>>>>>>>>>>>>>>>>>>>>> output
connection. Moreover, the job uses Tika to extract metadata and a
>>>>>>>>>>>>>>>>>>>>>> file
system as a repository connection. During the job, I don't extract the
>>>>>>>>>>>>>>>>>>>>>> content
of the documents. I was wandering if the issue comes from
>>>>>>>>>>>>>>>>>>>>>> elasticsearch
?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On
Wed, 30 Aug 2017 at 14:08, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>> daddywri@gmail.com>
wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
Hi Othman,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
ManifoldCF aborts a job if there's an error that
>>>>>>>>>>>>>>>>>>>>>>>
looks like it might go away on retry, but does not.  It can be either on
>>>>>>>>>>>>>>>>>>>>>>>
the repository side or on the output side.  If you look at the Simple
>>>>>>>>>>>>>>>>>>>>>>>
History in the UI, or at the manifoldcf.log file, you should be able to get
>>>>>>>>>>>>>>>>>>>>>>>
a better sense of what went wrong.  Without further information, I can't
>>>>>>>>>>>>>>>>>>>>>>>
say any more.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
Thanks,
>>>>>>>>>>>>>>>>>>>>>>>
Karl
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>>>>
i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
Hello,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
I'm Othman Belhaj, a software engineer from société
>>>>>>>>>>>>>>>>>>>>>>>>
générale in France. I'm actually using your recent version of manifoldCF
>>>>>>>>>>>>>>>>>>>>>>>>
2.8 . I'm working on an internal search engine. For this reason, I'm using
>>>>>>>>>>>>>>>>>>>>>>>>
manifoldcf in order to index documents on windows shares. I encountered a
>>>>>>>>>>>>>>>>>>>>>>>>
serious problem while crawling 35K documents. Most of the time, when
>>>>>>>>>>>>>>>>>>>>>>>>
manifoldcf start crawling a big sized documents (19Mo for example), it ends
>>>>>>>>>>>>>>>>>>>>>>>>
the job with the following error: repeated service interruptions - failure
>>>>>>>>>>>>>>>>>>>>>>>>
processing document : software caused connection abort: socket write error.
>>>>>>>>>>>>>>>>>>>>>>>>
Can you give me some tips on how to solve this
>>>>>>>>>>>>>>>>>>>>>>>>
problem, please ?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>>>>>>>>>>
I'm looking forward for your response.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
Othman BELHAJ
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>
>>>
>>

Mime
View raw message