manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Priya Arora <pr...@smartshore.nl>
Subject Re: Manifold Crawler Crashes
Date Thu, 20 Jun 2019 12:42:09 GMT
  I would highly recommend moving to Postgresql if you have any really
sizable crawl.
Yes, we are already using Postgresql 9.6.10 for it. Below are the settings
in postgresql.conf file our postgres server.

max_connections = 100
shared_buffers = 128MB
#temp_buffers = 8MB
#max_prepared_transactions = 0
#max_files_per_process = 1000
#autovacuum = on
#deadlock_timeout = 1s
#max_locks_per_transaction = 64
#max_pred_locks_per_transaction = 64

Can you please check if these parameters are sufficient to handle multiple
job ingesting huge data(8 Lakhs or more data)into an index. If not, can you
please let me know at maximum what these parameters should to be to have
optimal run of the jobs.

Alternatively you could just hand the manifoldCF process more memory.  Your
choice.
can you please help me on this, how to achieve this.

Also do we have to reduce some number of maximum connections in both
Repository and Output connections. can this be the symptom for heavy memory
load(due to multiple jobs running all together) that causes HEAP:-OUT OF
MEMORY.




On Thu, Jun 20, 2019 at 5:04 PM Karl Wright <daddywri@gmail.com> wrote:

> If you are running single-process on top of HSQLDB, all database tables
> are kept in memory so you need a lot of memory.
>
> I would highly recommend moving to Postgresql if you have any really
> sizable crawl.
>
> Alternatively you could just hand the manifoldCF process more memory.
> Your choice.
>
> However, if you cannot even use bash to get into the instance, something
> far more serious is happening to your docker world.
>
> Karl
>
>
> On Thu, Jun 20, 2019 at 6:27 AM Priya Arora <priya@smartshore.nl> wrote:
>
>> Hi Karl,
>> 1) It's single process deployment process.
>> 2) Not  able to access through bash(during crash happens)
>> 3) Server Configuration:-
>>  For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660
>> v3 @ 2.60GHz and
>> For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU E5-2660
>> v3 @ 2.60GHz
>> 4) Manifold configuration:-
>> Repository Max connection:-48
>> Output Max connections:-48
>>
>> This crash happens when we are running more than two parallel jobs with
>> almost same configuration at a time.
>> [image: image.png]
>>
>> Also, facing these warnings in the log file.It seems to be the reason for
>> crash.
>>
>> agents process ran out of memory - shutting down
>> java.lang.OutOfMemoryError: Java heap space
>>         at java.util.Arrays.copyOf(Arrays.java:3308)
>>         at java.util.BitSet.ensureCapacity(BitSet.java:337)
>>         at java.util.BitSet.expandTo(BitSet.java:352)
>>         at java.util.BitSet.set(BitSet.java:447)
>>         at
>> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>>         at
>> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
>>         at
>> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
>>         at
>> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
>>         at
>> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
>>         at
>> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
>>         at
>> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
>>         at
>> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
>>         at
>> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
>>         at
>> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
>>         at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>>         at
>> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>>
>> On Thu, Jun 20, 2019 at 3:36 PM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Priya,
>>>
>>> Being unable to reach the web interface sounds like either a network
>>> issue or a problem with the app server.
>>>
>>> Can you describe the configuration you are running in?  Is this a
>>> multiprocess deployment or a single-process deployment?
>>>
>>> When your docker container dies, can you still reach it via the standard
>>> in-container bash tools?  What is happening there?
>>>
>>> Karl
>>>
>>>
>>> On Thu, Jun 20, 2019 at 5:54 AM Priya Arora <priya@smartshore.nl> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>> Crash here means, "the site could not be reached" kind of HTML page
>>>> appears , when accessing http://localhost:3000/mcf-crawler-ui/index.jsp
>>>> .
>>>> Explanation:- When running certain job on ManifoldCF server(2.13) after
>>>> sometime (of successful running state), suddenly browser gives me "the site
>>>> could not be reached" (this kind of error) and page does not reload until
i
>>>> restart it through docker command.
>>>> once i will restart the container through docker MCF get to load again.
>>>>
>>>> Thanks
>>>> Priya
>>>>
>>>> On Thu, Jun 20, 2019 at 3:08 PM Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Please describe what you mean by "crash".  What actually happens?
>>>>>
>>>>> Karl
>>>>>
>>>>> On Thu, Jun 20, 2019, 2:04 AM Priya Arora <priya@smartshore.nl>
wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am running multiple jobs(2,3) simultaneously on Manifold server
and
>>>>>> the configuration is
>>>>>>
>>>>>> 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU
>>>>>> E5-2660 v3 @ 2.60GHz and
>>>>>>
>>>>>> 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU
>>>>>> E5-2660 v3 @ 2.60GHz
>>>>>> Job working is to fetch data from some public and intranet sites
and
>>>>>> then ingesting data into Elastic search.
>>>>>>
>>>>>> Maximum connection on both Repository connections and Output
>>>>>> connection is 48(for all 3 jobs).
>>>>>>
>>>>>> What problem i am facing here is when i am running multiple jobs
the
>>>>>> manifold crashes after some time and there is nothing inside manifold.log
>>>>>> files that hints out me some error.
>>>>>> Is the maximum connections increases(48+48+48) while running all
>>>>>> three jobs together?
>>>>>> So do i need to divide max connections(48) among all three jobs?
>>>>>> How many connections maximum we can have to run the jobs individually
>>>>>> and simultaneously.
>>>>>>
>>>>>> what should be the maximum allowed number of max handles in
>>>>>> properties.xml file and postgres config file?
>>>>>>
>>>>>> So the problem is to figure out what is the reason for the crawler
>>>>>> crash.
>>>>>> Can you please help me on that as soon as possible.
>>>>>>
>>>>>> Thanks and regards
>>>>>> Priya
>>>>>> priya@smartshore.nl
>>>>>>
>>>>>>
>>>>>>

Mime
View raw message