manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nigel Thomas <eml....@gmail.com>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Tue, 29 Jan 2013 22:27:36 GMT
Hello Anthony,

Worth taking a look at the puppet config for the service for launching
the mcf agents at
https://github.com/nigel-v-thomas/puppet-manifoldcf/blob/master/manifests/service.pp

Assuming the log errors are coming from the agent process, you will need
to add the variable "-Djava.security.egd=file:///dev/urandom" into
"multiprocess-example/processes/options.env".
The file options.env will need to be added into the puppet for
management as well.

Best,

Nigel
On 25/01/2013 17:02, Anthony Leonard wrote:
> Hi Nigel,
>
> You could be on to something here. I'll have a chat with the guys here
> as I think we've hit this issue before?
>
> Hope you're well :)  
>
> Not sure how I would configure ManifoldCF to use /dev/urandom ...
>
> Best wishes,
> Anthony.
>
> -- 
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Fri, Jan 25, 2013 at 4:21 PM, Nigel Thomas <eml.nvt@gmail.com
> <mailto:eml.nvt@gmail.com>> wrote:
>
>     Hello Anthony,
>
>     This issue may be related to the entropy exhaustion on Linux using
>     certain versions of JDBC driver, see http://goo.gl/8yA3a
>     have you try passing configuring JDBC to use /dev/urandom instead?
>
>     Cheers,
>
>     Nigel
>
>     On 22/01/2013 17:00, Anthony Leonard wrote:
>>     Dear Karl,
>>
>>     Our DBA noticed that each time our job was run 10 Oracle
>>     connections were created. So, we dropped the "Max connections"
>>     parameter on the repository connection config to 1 and re-ran the
>>     job with the DBA watching. The job worked fine but the DBA
>>     reported that 1 connection was created and then 10 more briefly ...
>>
>>     Out of curiosity we re-ran the job again with no further changes
>>     and this time got the following results:
>>
>>     * the job hung in the "Starting Up" phase again, with the same
>>     logging and symptoms as detailed before on this thread.
>>     * the DBA reported seeing no connections at all this time.
>>     * I have attached a thread dump created by "jstack -F <pid>".
>>     This is reporting all threads as blocked.
>>
>>     Any ideas? Any help with this would certainly be very gratefully
>>     received.
>>
>>     Best wishes,
>>     Anthony.
>>
>>     -- 
>>     Dr Anthony Leonard
>>     System Integrator, Information Directorate
>>     University of York, Heslington, York, UK, YO10 5DD
>>     Tel: +44 (0)1904 434350
>>     <tel:%2B44%20%280%291904%20434350> http://twitter.com/apbleonard
>>     Times Higher Education University of the Year 2010
>>
>>
>>     On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com
>>     <mailto:daddywri@gmail.com>> wrote:
>>
>>         kill -QUIT should not abort the agents process, just cause a
>>         thread
>>         dump.  kill -9 is a different story.
>>
>>         You can also do the same thing by using jstack, in the jvm
>>         bin directory.
>>
>>         Karl
>>
>>
>>         On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>>         <anthony.leonard@york.ac.uk
>>         <mailto:anthony.leonard@york.ac.uk>> wrote:
>>         > Dear Karl,
>>         >
>>         > Many thanks for your insights. I'll do a kill -QUIT next
>>         time we have this
>>         > issue which should hopefully give me the thread dump.
>>         However we've noticed
>>         > that killing processes means we have to run the
>>         "locks-clean" script so it's
>>         > not our favourite way of doing it.
>>         >
>>         > Also I definitely think a timeout for queries would be a
>>         good thing.
>>         >
>>         > I guess we go back to checking that the connection to the
>>         database should
>>         > have been ok last night...
>>         >
>>         > Best wishes,
>>         > Anthony.
>>         >
>>         > --
>>         > Dr Anthony Leonard
>>         > System Integrator, Information Directorate
>>         > University of York, Heslington, York, UK, YO10 5DD
>>         > Tel: +44 (0)1904 434350 <tel:%2B44%20%280%291904%20434350>
>>         http://twitter.com/apbleonard
>>         > Times Higher Education University of the Year 2010
>>         >
>>         >
>>         > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright
>>         <daddywri@gmail.com <mailto:daddywri@gmail.com>> wrote:
>>         >>
>>         >> Hi Anthony,
>>         >>
>>         >> What happens between the framework recognizing that the
>>         job should be
>>         >> started (which it does fine in both cases), and actually
>>         achieving a
>>         >> correct job start, is the "seeding" phase, which is going
>>         to try to
>>         >> execute the seeding query against your Oracle database.
>>          If something
>>         >> happens at that time to hang the JDBC connection's seeding
>>         query, then
>>         >> it precisely explains the behavior you are seeing.
>>         >>
>>         >> It is also the case that the timeout on the queries that
>>         the JDBC
>>         >> connector does is effectively infinite.  This makes me
>>         suspicious that
>>         >> what is happening is an Oracle query is going out but
>>         there is no
>>         >> response ever coming back.
>>         >>
>>         >> The other possibility is that the JDBC connector is in
>>         fact correctly
>>         >> throwing a ServiceInterruption, but that the ManifoldCF
>>         code is either
>>         >> not handling it properly, or the connector is not forming
>>         it properly.
>>         >>  In that case, when you notice a hung job, the startup
>>         thread will be
>>         >> a particular place in the code, and I can diagnose it that
>>         way.
>>         >>
>>         >> The first order of business is therefore to get a thread
>>         dump when the
>>         >> system is "hung".  That will help confirm the picture.
>>          There are a
>>         >> number of additional questions here.
>>         >>
>>         >> (1) Why is this happening?  Is there any possibility that
>>         the Oracle
>>         >> database you are crawling is (very occasionally) not able
>>         to properly
>>         >> respond to a JDBC query?  I can imagine that, under some
>>         network
>>         >> conditions, it might be possible for the Oracle JDBC
>>         driver to wind up
>>         >> waiting indefinitely for a response that never comes.
>>         >>
>>         >> (2) Given that we can't always control the infrastructure
>>         we're trying
>>         >> to crawl through, should we attempt to provide a reasonable
>>         >> workaround?  For example, a timeout on JDBC connector
>>         queries, where
>>         >> we throw a ServiceInterruption if the timeout is exceeded?
>>         >>
>>         >> Karl
>>         >>
>>         >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>>         >> <anthony.leonard@york.ac.uk
>>         <mailto:anthony.leonard@york.ac.uk>> wrote:
>>         >> > Hi there,
>>         >> >
>>         >> > We have recently started running a nightly job 2AM in
>>         ManifoldCF to
>>         >> > extract
>>         >> > data from an Oracle repository and populate a Solr
>>         index. Most nights
>>         >> > this
>>         >> > works fine, but occasionally the job has been hanging at
>>         the "Starting
>>         >> > up"
>>         >> > phase. We have observed this on our test setup also
>>         occasionally. A
>>         >> > restart
>>         >> > of ManifoldCF usually solves this.
>>         >> >
>>         >> > Using the simple history reports today I looked up all
>>         records and
>>         >> > sorted
>>         >> > them by the "Time" column, largest first, and found the
>>         following:
>>         >> >
>>         >> > "Start Time","Activity","Identifier","Result
>>         >> > Code","Bytes","Time","Result
>>         >> > Description"
>>         >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>>         >> > ...","ERROR","0","1926607529","Interrupted: null"
>>         >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>>         >> > ...","ERROR","0","31644956","Interrupted: null"
>>         >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>>         >> > ...","ERROR","0","31637594","Interrupted: null"
>>         >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>>         >> > ...","OK","0","17511",""
>>         >> > ... etc ...
>>         >> >
>>         >> > If the Time column is in millis that means the first
>>         query was hanging
>>         >> > for
>>         >> > 22 days! (This was in the period before we went live
>>         when our live
>>         >> > server
>>         >> > was sitting idle for a while.) The other two occasions
>>         it was hanging
>>         >> > for
>>         >> > about 8 hours until we arrived to restart the job in the
>>         morning. I have
>>         >> > confirmed that the Oracle database we are connecting to
>>         was available
>>         >> > throughout these periods. These times are also too long
>>         for any network
>>         >> > or
>>         >> > database timeouts, which makes me suspect that it's a
>>         problem with the
>>         >> > application.
>>         >> >
>>         >> > We have the following logging config in properties.xml
>>         >> >
>>         >> >   <property name="org.apache.manifoldcf.jobs
>>         <http://org.apache.manifoldcf.jobs>" value="ALL"/>
>>         >> >   <property name="org.apache.manifoldcf.connectors"
>>         value="ALL"/>
>>         >> >   <property name="org.apache.manifoldcf.agents"
>>         value="ALL"/>
>>         >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>>         >> >
>>         >> > The job failed again last night and when I checked at
>>         10:40 AM this
>>         >> > morning
>>         >> > the last few lines of manifoldcf.log were:
>>         >> >
>>         >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -
>>         Checking if job
>>         >> > 1352455005553 needs to be started; it was last checked
>>         at 1358733575454,
>>         >> > and
>>         >> > now it is 1358733585635
>>         >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No
>>         time match found
>>         >> > within interval 1358733575454 to 1358733585635
>>         >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -
>>         Checking if job
>>         >> > 1352455005553 needs to be started; it was last checked
>>         at 1358733585636,
>>         >> > and
>>         >> > now it is 1358733595662
>>         >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No
>>         time match found
>>         >> > within interval 1358733585636 to 1358733595662
>>         >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -
>>         Checking if job
>>         >> > 1352455005553 needs to be started; it was last checked
>>         at 1358733595663,
>>         >> > and
>>         >> > now it is 1358733605813
>>         >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time
>>         match FOUND
>>         >> > within
>>         >> > interval 1358733595663 to 1358733605813
>>         >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>>         '1352455005553'
>>         >> > is
>>         >> > within run window at 1358733605813 ms. (which starts at
>>         1358733600000
>>         >> > ms.)
>>         >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) -
>>         Signalled for job
>>         >> > start
>>         >> > for job 1352455005553
>>         >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>>         >> > 1352455005553
>>         >> > for startup
>>         >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC:
>>         The connect string
>>         >> > is
>>         >> >
>>         'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york
>>         <http://oradwhlive.york.ac.uk:1521/dwhlive.csrv.york>'
>>         >> >
>>         >> > After that - nothing. Once I restarted manifoldcf this
>>         morning the job
>>         >> > magically restarted itself and the following log
>>         messages were added
>>         >> > where
>>         >> > it had left off:
>>         >> >
>>         >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>>         >> > 1352455005553
>>         >> > back to 'ReadyForStartup' state
>>         >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting
>>         due to restart
>>         >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>>         >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>>         >> > 1352455005553
>>         >> > for startup
>>         >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The
>>         connect string is
>>         >> >
>>         'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york
>>         <http://oradwhlive.york.ac.uk:1521/dwhlive.csrv.york>'
>>         >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job
>>         1352455005553 is
>>         >> > now
>>         >> > started
>>         >> > ... etc ...
>>         >> >
>>         >> > So it appears that the application is running fine with
>>         a Job start
>>         >> > thread
>>         >> > logging nicely every ten seconds until it tries to start
>>         this job and
>>         >> > hangs
>>         >> > entirely until the system is shutdown - even though it
>>         performed that
>>         >> > same
>>         >> > tasks perfectly the night before and later the same day.
>>         >> >
>>         >> > Can anyone advise on what might be happening here? We
>>         are running
>>         >> > ManifoldCF
>>         >> > version 1.0.1 on Ubuntu 10.04.
>>         >> >
>>         >> > Best wishes,
>>         >> > Anthony.
>>         >> >
>>         >> > --
>>         >> > Dr Anthony Leonard
>>         >> > System Integrator, Information Directorate
>>         >> > University of York, Heslington, York, UK, YO10 5DD
>>         >> > Tel: +44 (0)1904 434350
>>         <tel:%2B44%20%280%291904%20434350> http://twitter.com/apbleonard
>>         >> > Times Higher Education University of the Year 2010
>>         >
>>         >
>>
>>
>
>


Mime
View raw message