manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 12:15:05 GMT
Oh, and you also may need to edit your options.env files to include them in
the classpath for startup.

Karl


On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddywri@gmail.com> wrote:

> If you are amenable, there is another workaround you could try.
> Specifically:
>
> (1) Shut down all MCF processes.
> (2) Move the following two files from connector-common-lib to lib:
>
> xmlbeans-2.6.0.jar
> poi-ooxml-schemas-3.15.jar
>
> (3) Restart everything and see if your crawl resumes.
>
> Please let me know what happens.
>
> Karl
>
>
>
> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> I created a ticket for this: CONNECTORS-1450.
>>
>> One simple workaround is to use the external Tika server transformer
>> rather than the embedded Tika Extractor.  I'm still looking into why the
>> jar is not being found.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93othman@gmail.com>
>> wrote:
>>
>>> Yes, I'm actually using the latest binary version, and my job got stuck
>>> on that specific file.
>>> The job status is still Running. You can see it in the attached file.
>>> For your information, the job started yesterday.
>>>
>>> Thanks,
>>>
>>> Othman
>>>
>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> It looks like a dependency of Apache POI is missing.
>>>> I think we will need a ticket to address this, if you are indeed using
>>>> the binary distribution.
>>>>
>>>> Thanks!
>>>> Karl
>>>>
>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm actually using the binary version. For security reasons, I can't
>>>>> send any files from my computer. I have copied the stack trace and scanned
>>>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read
the
>>>>> documentation about how to restrict the crawling and I don't think the
'|'
>>>>> works in the specified. For instance, I would like to restrict the crawling
>>>>> for the documents that counts the 'sound' word . I proceed as follows:
>>>>> *(SON)* . the document is with capital letters and I noticed that it
didn't
>>>>> take it into consideration.
>>>>>
>>>>> Thanks,
>>>>> Othman
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>
>>>>>> Hi Othman,
>>>>>>
>>>>>> The way you restrict documents with the windows share connector is
by
>>>>>> specifying information on the "Paths" tab in jobs that crawl windows
>>>>>> shares.  There is end-user documentation both online and distributed
with
>>>>>> all binary distributions that describe how to do this.  Have you
found it?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Karl,
>>>>>>>
>>>>>>> Thank you for your response, I will start using zookeeper and
I will
>>>>>>> let you know if it works. I have another question to ask. Actually,
I need
>>>>>>> to make some filters while crawling. I don't want to crawl some
files and
>>>>>>> some folders. Could you give me an example of how to use the
regex. Does
>>>>>>> the regex allow to use /i to ignore cases ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Othman
>>>>>>>
>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Beelz,
>>>>>>>>
>>>>>>>> File-based sync is deprecated because people often have problems
>>>>>>>> with getting file permissions right, and they do not understand
how to shut
>>>>>>>> processes down cleanly, and zookeeper is resilient against
that.  I highly
>>>>>>>> recommend using zookeeper sync.
>>>>>>>>
>>>>>>>> ManifoldCF is engineered to not put files into memory so
you do not
>>>>>>>> need huge amounts of memory.  The default values are more
than enough for
>>>>>>>> 35,000 files, which is a pretty small job for ManifoldCF.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I'm actually not using zookeeper. i want to know how
is zookeeper
>>>>>>>>> different from file based sync? I also need a guidance
on how to manage my
>>>>>>>>> pc's memory. How many Go should I allocate for the start-agent
of
>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files
?
>>>>>>>>>
>>>>>>>>> Othman.
>>>>>>>>>
>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Your disk is not writable for some reason, and that's
interfering
>>>>>>>>>> with ManifoldCF 2.8 locking.
>>>>>>>>>>
>>>>>>>>>> I would suggest two things:
>>>>>>>>>>
>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based
sync.
>>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>
>>>>>>>>>>> Thank you Mr Karl for your quick response. I
have looked into
>>>>>>>>>>> the ManifoldCF log file and extracted the following
warnings :
>>>>>>>>>>>
>>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.8
>>>>>>>>>>> \multiprocess-file-example\.\.\synch
>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is
denied.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> - Couldn't write to lock file; disk may be full.
Shutting down
>>>>>>>>>>> process; locks may be left dangling. You must
cleanup before restarting.
>>>>>>>>>>>
>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch
output
>>>>>>>>>>> connection. Moreover, the job uses Tika to extract
metadata and a file
>>>>>>>>>>> system as a repository connection. During the
job, I don't extract the
>>>>>>>>>>> content of the documents. I was wandering if
the issue comes from
>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>
>>>>>>>>>>> Othman.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <daddywri@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>
>>>>>>>>>>>> ManifoldCF aborts a job if there's an error
that looks like it
>>>>>>>>>>>> might go away on retry, but does not.  It
can be either on the repository
>>>>>>>>>>>> side or on the output side.  If you look
at the Simple History in the UI,
>>>>>>>>>>>> or at the manifoldcf.log file, you should
be able to get a better sense of
>>>>>>>>>>>> what went wrong.  Without further information,
I can't say any more.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki
<
>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer
from société générale
>>>>>>>>>>>>> in France. I'm actually using your recent
version of manifoldCF 2.8 . I'm
>>>>>>>>>>>>> working on an internal search engine.
For this reason, I'm using manifoldcf
>>>>>>>>>>>>> in order to index documents on windows
shares. I encountered a serious
>>>>>>>>>>>>> problem while crawling 35K documents.
Most of the time, when manifoldcf
>>>>>>>>>>>>> start crawling a big sized documents
(19Mo for example), it ends the job
>>>>>>>>>>>>> with the following error: repeated service
interruptions - failure
>>>>>>>>>>>>> processing document : software caused
connection abort: socket write error.
>>>>>>>>>>>>> Can you give me some tips on how to solve
this problem, please
>>>>>>>>>>>>> ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch
2.1.0 .
>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
>

Mime
View raw message