manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 13:15:25 GMT
Hi Othman,

The Paths tab uses standard Unix/Windows file name wildcards (e.g. "*" and
"?") for file names, *not* regular expressions.  We could not support
complex regular expressions without adding new kinds of exclude/include
options for paths, or backwards compatibility will be harmed.

In the decade that this connector has existed, nobody so far has needed
regular expressions instead of file name wildcards.

Are you sure you need this ability?  If so, you can create a ticket for
this enhancement?  It will not be done quickly, that is certain.

Karl


On Thu, Aug 31, 2017 at 9:01 AM, Beelz Ryuzaki <i93othman@gmail.com> wrote:

> I have tried what you told me to do, and you expected the crawling
> resumed. How about the regular expressions? How can I make complex regular
> expressions in the job's paths tab ?
>
> Thank you very much for your help.
>
> Othman.
>
>
> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93othman@gmail.com> wrote:
>
>> Ok, I will try it right away and let you know if it works.
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Oh, and you also may need to edit your options.env files to include them
>>> in the classpath for startup.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> If you are amenable, there is another workaround you could try.
>>>> Specifically:
>>>>
>>>> (1) Shut down all MCF processes.
>>>> (2) Move the following two files from connector-common-lib to lib:
>>>>
>>>> xmlbeans-2.6.0.jar
>>>> poi-ooxml-schemas-3.15.jar
>>>>
>>>> (3) Restart everything and see if your crawl resumes.
>>>>
>>>> Please let me know what happens.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>
>>>>> One simple workaround is to use the external Tika server transformer
>>>>> rather than the embedded Tika Extractor.  I'm still looking into why
the
>>>>> jar is not being found.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Yes, I'm actually using the latest binary version, and my job got
>>>>>> stuck on that specific file.
>>>>>> The job status is still Running. You can see it in the attached file.
>>>>>> For your information, the job started yesterday.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Othman
>>>>>>
>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddywri@gmail.com>
wrote:
>>>>>>
>>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>>> I think we will need a ticket to address this, if you are indeed
>>>>>>> using the binary distribution.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Karl
>>>>>>>
>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm actually using the binary version. For security reasons,
I
>>>>>>>> can't send any files from my computer. I have copied the
stack trace and
>>>>>>>> scanned it with my cellphone. I hope it will be helpful.
Meanwhile, I have
>>>>>>>> read the documentation about how to restrict the crawling
and I don't think
>>>>>>>> the '|' works in the specified. For instance, I would like
to restrict the
>>>>>>>> crawling for the documents that counts the 'sound' word .
I proceed as
>>>>>>>> follows: *(SON)* . the document is with capital letters and
I noticed that
>>>>>>>> it didn't take it into consideration.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Othman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Othman,
>>>>>>>>>
>>>>>>>>> The way you restrict documents with the windows share
connector is
>>>>>>>>> by specifying information on the "Paths" tab in jobs
that crawl windows
>>>>>>>>> shares.  There is end-user documentation both online
and distributed with
>>>>>>>>> all binary distributions that describe how to do this.
 Have you found it?
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Karl,
>>>>>>>>>>
>>>>>>>>>> Thank you for your response, I will start using zookeeper
and I
>>>>>>>>>> will let you know if it works. I have another question
to ask. Actually, I
>>>>>>>>>> need to make some filters while crawling. I don't
want to crawl some files
>>>>>>>>>> and some folders. Could you give me an example of
how to use the regex.
>>>>>>>>>> Does the regex allow to use /i to ignore cases ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Othman
>>>>>>>>>>
>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>
>>>>>>>>>>> File-based sync is deprecated because people
often have problems
>>>>>>>>>>> with getting file permissions right, and they
do not understand how to shut
>>>>>>>>>>> processes down cleanly, and zookeeper is resilient
against that.  I highly
>>>>>>>>>>> recommend using zookeeper sync.
>>>>>>>>>>>
>>>>>>>>>>> ManifoldCF is engineered to not put files into
memory so you do
>>>>>>>>>>> not need huge amounts of memory.  The default
values are more than enough
>>>>>>>>>>> for 35,000 files, which is a pretty small job
for ManifoldCF.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki
<
>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm actually not using zookeeper. i want
to know how is
>>>>>>>>>>>> zookeeper different from file based sync?
I also need a guidance on how to
>>>>>>>>>>>> manage my pc's memory. How many Go should
I allocate for the start-agent of
>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler
35K files ?
>>>>>>>>>>>>
>>>>>>>>>>>> Othman.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright
<daddywri@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Your disk is not writable for some reason,
and that's
>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>>
>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of
file-based sync.
>>>>>>>>>>>>> (2) Have a look if you still get failures
after that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz
Ryuzaki <
>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you Mr Karl for your quick
response. I have looked into
>>>>>>>>>>>>>> the ManifoldCF log file and extracted
the following warnings :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.
>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed
: Access is denied.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Couldn't write to lock file; disk
may be full. Shutting
>>>>>>>>>>>>>> down process; locks may be left dangling.
You must cleanup before
>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ES (lowercase) synapses being the
elasticsearch output
>>>>>>>>>>>>>> connection. Moreover, the job uses
Tika to extract metadata and a file
>>>>>>>>>>>>>> system as a repository connection.
During the job, I don't extract the
>>>>>>>>>>>>>> content of the documents. I was wandering
if the issue comes from
>>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl
Wright <daddywri@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's
an error that looks like
>>>>>>>>>>>>>>> it might go away on retry, but
does not.  It can be either on the
>>>>>>>>>>>>>>> repository side or on the output
side.  If you look at the Simple History
>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log
file, you should be able to get a
>>>>>>>>>>>>>>> better sense of what went wrong.
 Without further information, I can't say
>>>>>>>>>>>>>>> any more.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33
AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software
engineer from société
>>>>>>>>>>>>>>>> générale in France. I'm
actually using your recent version of manifoldCF
>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal
search engine. For this reason, I'm using
>>>>>>>>>>>>>>>> manifoldcf in order to index
documents on windows shares. I encountered a
>>>>>>>>>>>>>>>> serious problem while crawling
35K documents. Most of the time, when
>>>>>>>>>>>>>>>> manifoldcf start crawling
a big sized documents (19Mo for example), it ends
>>>>>>>>>>>>>>>> the job with the following
error: repeated service interruptions - failure
>>>>>>>>>>>>>>>> processing document : software
caused connection abort: socket write error.
>>>>>>>>>>>>>>>> Can you give me some tips
on how to solve this problem,
>>>>>>>>>>>>>>>> please ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and
elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>> I'm looking forward for your
response.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>

Mime
View raw message