manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Question about ManifoldCF 2.8
Date Thu, 31 Aug 2017 13:16:15 GMT
Once again, I need a stack trace to diagnose what the problem is.

Thanks,
Karl


On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <i93othman@gmail.com> wrote:

> Oh, actually it didn't solve the problem. I looked into the log file and
> saw the following error:
>
> Error tossed : org/apache/poi/POIXMLTypeLoader
> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>
> Maybe another jar is missing ?
>
> Othman.
>
> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <i93othman@gmail.com> wrote:
>
>> I have tried what you told me to do, and you expected the crawling
>> resumed. How about the regular expressions? How can I make complex regular
>> expressions in the job's paths tab ?
>>
>> Thank you very much for your help.
>>
>> Othman.
>>
>>
>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <i93othman@gmail.com> wrote:
>>
>>> Ok, I will try it right away and let you know if it works.
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Oh, and you also may need to edit your options.env files to include
>>>> them in the classpath for startup.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> If you are amenable, there is another workaround you could try.
>>>>> Specifically:
>>>>>
>>>>> (1) Shut down all MCF processes.
>>>>> (2) Move the following two files from connector-common-lib to lib:
>>>>>
>>>>> xmlbeans-2.6.0.jar
>>>>> poi-ooxml-schemas-3.15.jar
>>>>>
>>>>> (3) Restart everything and see if your crawl resumes.
>>>>>
>>>>> Please let me know what happens.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>>
>>>>>> One simple workaround is to use the external Tika server transformer
>>>>>> rather than the embedded Tika Extractor.  I'm still looking into
why the
>>>>>> jar is not being found.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <i93othman@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, I'm actually using the latest binary version, and my job
got
>>>>>>> stuck on that specific file.
>>>>>>> The job status is still Running. You can see it in the attached
>>>>>>> file. For your information, the job started yesterday.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Othman
>>>>>>>
>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>>>> I think we will need a ticket to address this, if you are
indeed
>>>>>>>> using the binary distribution.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <i93othman@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> I'm actually using the binary version. For security reasons,
I
>>>>>>>>> can't send any files from my computer. I have copied
the stack trace and
>>>>>>>>> scanned it with my cellphone. I hope it will be helpful.
Meanwhile, I have
>>>>>>>>> read the documentation about how to restrict the crawling
and I don't think
>>>>>>>>> the '|' works in the specified. For instance, I would
like to restrict the
>>>>>>>>> crawling for the documents that counts the 'sound' word
. I proceed as
>>>>>>>>> follows: *(SON)* . the document is with capital letters
and I noticed that
>>>>>>>>> it didn't take it into consideration.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Othman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Othman,
>>>>>>>>>>
>>>>>>>>>> The way you restrict documents with the windows share
connector
>>>>>>>>>> is by specifying information on the "Paths" tab in
jobs that crawl windows
>>>>>>>>>> shares.  There is end-user documentation both online
and distributed with
>>>>>>>>>> all binary distributions that describe how to do
this.  Have you found it?
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>
>>>>>>>>>>> Thank you for your response, I will start using
zookeeper and I
>>>>>>>>>>> will let you know if it works. I have another
question to ask. Actually, I
>>>>>>>>>>> need to make some filters while crawling. I don't
want to crawl some files
>>>>>>>>>>> and some folders. Could you give me an example
of how to use the regex.
>>>>>>>>>>> Does the regex allow to use /i to ignore cases
?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Othman
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <daddywri@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>
>>>>>>>>>>>> File-based sync is deprecated because people
often have
>>>>>>>>>>>> problems with getting file permissions right,
and they do not understand
>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper
is resilient against
>>>>>>>>>>>> that.  I highly recommend using zookeeper
sync.
>>>>>>>>>>>>
>>>>>>>>>>>> ManifoldCF is engineered to not put files
into memory so you do
>>>>>>>>>>>> not need huge amounts of memory.  The default
values are more than enough
>>>>>>>>>>>> for 35,000 files, which is a pretty small
job for ManifoldCF.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki
<
>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm actually not using zookeeper. i want
to know how is
>>>>>>>>>>>>> zookeeper different from file based sync?
I also need a guidance on how to
>>>>>>>>>>>>> manage my pc's memory. How many Go should
I allocate for the start-agent of
>>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to
crawler 35K files ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright
<daddywri@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Your disk is not writable for some
reason, and that's
>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead
of file-based sync.
>>>>>>>>>>>>>> (2) Have a look if you still get
failures after that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM,
Beelz Ryuzaki <
>>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you Mr Karl for your quick
response. I have looked
>>>>>>>>>>>>>>> into the ManifoldCF log file
and extracted the following warnings :
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.
>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed
: Access is denied.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Couldn't write to lock file;
disk may be full. Shutting
>>>>>>>>>>>>>>> down process; locks may be left
dangling. You must cleanup before
>>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ES (lowercase) synapses being
the elasticsearch output
>>>>>>>>>>>>>>> connection. Moreover, the job
uses Tika to extract metadata and a file
>>>>>>>>>>>>>>> system as a repository connection.
During the job, I don't extract the
>>>>>>>>>>>>>>> content of the documents. I was
wandering if the issue comes from
>>>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08,
Karl Wright <
>>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ManifoldCF aborts a job if
there's an error that looks like
>>>>>>>>>>>>>>>> it might go away on retry,
but does not.  It can be either on the
>>>>>>>>>>>>>>>> repository side or on the
output side.  If you look at the Simple History
>>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log
file, you should be able to get a
>>>>>>>>>>>>>>>> better sense of what went
wrong.  Without further information, I can't say
>>>>>>>>>>>>>>>> any more.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33
AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>> i93othman@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a
software engineer from société
>>>>>>>>>>>>>>>>> générale in France.
I'm actually using your recent version of manifoldCF
>>>>>>>>>>>>>>>>> 2.8 . I'm working on
an internal search engine. For this reason, I'm using
>>>>>>>>>>>>>>>>> manifoldcf in order to
index documents on windows shares. I encountered a
>>>>>>>>>>>>>>>>> serious problem while
crawling 35K documents. Most of the time, when
>>>>>>>>>>>>>>>>> manifoldcf start crawling
a big sized documents (19Mo for example), it ends
>>>>>>>>>>>>>>>>> the job with the following
error: repeated service interruptions - failure
>>>>>>>>>>>>>>>>> processing document :
software caused connection abort: socket write error.
>>>>>>>>>>>>>>>>> Can you give me some
tips on how to solve this problem,
>>>>>>>>>>>>>>>>> please ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x
and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>>> I'm looking forward for
your response.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>

Mime
View raw message