manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Beelz Ryuzaki <i93oth...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 10:57:56 GMT
I'm actually using the binary version. For security reasons, I can't send
any files from my computer. I have copied the stack trace and scanned it
with my cellphone. I hope it will be helpful. Meanwhile, I have read the
documentation about how to restrict the crawling and I don't think the '|'
works in the specified. For instance, I would like to restrict the crawling
for the documents that counts the 'sound' word . I proceed as follows:
*(SON)* . the document is with capital letters and I noticed that it didn't
take it into consideration.

Thanks,
Othman



On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddywri@gmail.com> wrote:

> Hi Othman,
>
> The way you restrict documents with the windows share connector is by
> specifying information on the "Paths" tab in jobs that crawl windows
> shares.  There is end-user documentation both online and distributed with
> all binary distributions that describe how to do this.  Have you found it?
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <i93othman@gmail.com>
> wrote:
>
>> Hello Karl,
>>
>> Thank you for your response, I will start using zookeeper and I will let
>> you know if it works. I have another question to ask. Actually, I need to
>> make some filters while crawling. I don't want to crawl some files and some
>> folders. Could you give me an example of how to use the regex. Does the
>> regex allow to use /i to ignore cases ?
>>
>> Thanks,
>> Othman
>>
>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Beelz,
>>>
>>> File-based sync is deprecated because people often have problems with
>>> getting file permissions right, and they do not understand how to shut
>>> processes down cleanly, and zookeeper is resilient against that.  I highly
>>> recommend using zookeeper sync.
>>>
>>> ManifoldCF is engineered to not put files into memory so you do not need
>>> huge amounts of memory.  The default values are more than enough for 35,000
>>> files, which is a pretty small job for ManifoldCF.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>> wrote:
>>>
>>>> I'm actually not using zookeeper. i want to know how is zookeeper
>>>> different from file based sync? I also need a guidance on how to manage my
>>>> pc's memory. How many Go should I allocate for the start-agent of
>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>
>>>> Othman.
>>>>
>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Your disk is not writable for some reason, and that's interfering with
>>>>> ManifoldCF 2.8 locking.
>>>>>
>>>>> I would suggest two things:
>>>>>
>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>> (2) Have a look if you still get failures after that.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Mr Karl,
>>>>>>
>>>>>> Thank you Mr Karl for your quick response. I have looked into the
>>>>>> ManifoldCF log file and extracted the following warnings :
>>>>>>
>>>>>> - Attempt to set file lock
>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase)
>>>>>> Synapses.lock' failed : Access is denied.
>>>>>>
>>>>>>
>>>>>> - Couldn't write to lock file; disk may be full. Shutting down
>>>>>> process; locks may be left dangling. You must cleanup before restarting.
>>>>>>
>>>>>> ES (lowercase) synapses being the elasticsearch output connection.
>>>>>> Moreover, the job uses Tika to extract metadata and a file system
as a
>>>>>> repository connection. During the job, I don't extract the content
of the
>>>>>> documents. I was wandering if the issue comes from elasticsearch
?
>>>>>>
>>>>>> Othman.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>
>>>>>>> Hi Othman,
>>>>>>>
>>>>>>> ManifoldCF aborts a job if there's an error that looks like it
might
>>>>>>> go away on retry, but does not.  It can be either on the repository
side or
>>>>>>> on the output side.  If you look at the Simple History in the
UI, or at the
>>>>>>> manifoldcf.log file, you should be able to get a better sense
of what went
>>>>>>> wrong.  Without further information, I can't say any more.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I'm Othman Belhaj, a software engineer from société générale
in
>>>>>>>> France. I'm actually using your recent version of manifoldCF
2.8 . I'm
>>>>>>>> working on an internal search engine. For this reason, I'm
using manifoldcf
>>>>>>>> in order to index documents on windows shares. I encountered
a serious
>>>>>>>> problem while crawling 35K documents. Most of the time, when
manifoldcf
>>>>>>>> start crawling a big sized documents (19Mo for example),
it ends the job
>>>>>>>> with the following error: repeated service interruptions
- failure
>>>>>>>> processing document : software caused connection abort: socket
write error.
>>>>>>>> Can you give me some tips on how to solve this problem, please
?
>>>>>>>>
>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>> I'm looking forward for your response.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Othman BELHAJ
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>

Mime
View raw message