manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nigel Thomas <eml....@gmail.com>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Fri, 25 Jan 2013 16:21:59 GMT
Hello Anthony,

This issue may be related to the entropy exhaustion on Linux using
certain versions of JDBC driver, see http://goo.gl/8yA3a
have you try passing configuring JDBC to use /dev/urandom instead?

Cheers,

Nigel
On 22/01/2013 17:00, Anthony Leonard wrote:
> Dear Karl,
>
> Our DBA noticed that each time our job was run 10 Oracle connections
> were created. So, we dropped the "Max connections" parameter on the
> repository connection config to 1 and re-ran the job with the DBA
> watching. The job worked fine but the DBA reported that 1 connection
> was created and then 10 more briefly ...
>
> Out of curiosity we re-ran the job again with no further changes and
> this time got the following results:
>
> * the job hung in the "Starting Up" phase again, with the same logging
> and symptoms as detailed before on this thread.
> * the DBA reported seeing no connections at all this time.
> * I have attached a thread dump created by "jstack -F <pid>". This is
> reporting all threads as blocked.
>
> Any ideas? Any help with this would certainly be very gratefully received.
>
> Best wishes,
> Anthony.
>
> -- 
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com
> <mailto:daddywri@gmail.com>> wrote:
>
>     kill -QUIT should not abort the agents process, just cause a thread
>     dump.  kill -9 is a different story.
>
>     You can also do the same thing by using jstack, in the jvm bin
>     directory.
>
>     Karl
>
>
>     On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
>     <anthony.leonard@york.ac.uk <mailto:anthony.leonard@york.ac.uk>>
>     wrote:
>     > Dear Karl,
>     >
>     > Many thanks for your insights. I'll do a kill -QUIT next time we
>     have this
>     > issue which should hopefully give me the thread dump. However
>     we've noticed
>     > that killing processes means we have to run the "locks-clean"
>     script so it's
>     > not our favourite way of doing it.
>     >
>     > Also I definitely think a timeout for queries would be a good thing.
>     >
>     > I guess we go back to checking that the connection to the
>     database should
>     > have been ok last night...
>     >
>     > Best wishes,
>     > Anthony.
>     >
>     > --
>     > Dr Anthony Leonard
>     > System Integrator, Information Directorate
>     > University of York, Heslington, York, UK, YO10 5DD
>     > Tel: +44 (0)1904 434350 <tel:%2B44%20%280%291904%20434350>
>     http://twitter.com/apbleonard
>     > Times Higher Education University of the Year 2010
>     >
>     >
>     > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <daddywri@gmail.com
>     <mailto:daddywri@gmail.com>> wrote:
>     >>
>     >> Hi Anthony,
>     >>
>     >> What happens between the framework recognizing that the job
>     should be
>     >> started (which it does fine in both cases), and actually
>     achieving a
>     >> correct job start, is the "seeding" phase, which is going to try to
>     >> execute the seeding query against your Oracle database.  If
>     something
>     >> happens at that time to hang the JDBC connection's seeding
>     query, then
>     >> it precisely explains the behavior you are seeing.
>     >>
>     >> It is also the case that the timeout on the queries that the JDBC
>     >> connector does is effectively infinite.  This makes me
>     suspicious that
>     >> what is happening is an Oracle query is going out but there is no
>     >> response ever coming back.
>     >>
>     >> The other possibility is that the JDBC connector is in fact
>     correctly
>     >> throwing a ServiceInterruption, but that the ManifoldCF code is
>     either
>     >> not handling it properly, or the connector is not forming it
>     properly.
>     >>  In that case, when you notice a hung job, the startup thread
>     will be
>     >> a particular place in the code, and I can diagnose it that way.
>     >>
>     >> The first order of business is therefore to get a thread dump
>     when the
>     >> system is "hung".  That will help confirm the picture.  There are a
>     >> number of additional questions here.
>     >>
>     >> (1) Why is this happening?  Is there any possibility that the
>     Oracle
>     >> database you are crawling is (very occasionally) not able to
>     properly
>     >> respond to a JDBC query?  I can imagine that, under some network
>     >> conditions, it might be possible for the Oracle JDBC driver to
>     wind up
>     >> waiting indefinitely for a response that never comes.
>     >>
>     >> (2) Given that we can't always control the infrastructure we're
>     trying
>     >> to crawl through, should we attempt to provide a reasonable
>     >> workaround?  For example, a timeout on JDBC connector queries,
>     where
>     >> we throw a ServiceInterruption if the timeout is exceeded?
>     >>
>     >> Karl
>     >>
>     >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>     >> <anthony.leonard@york.ac.uk
>     <mailto:anthony.leonard@york.ac.uk>> wrote:
>     >> > Hi there,
>     >> >
>     >> > We have recently started running a nightly job 2AM in
>     ManifoldCF to
>     >> > extract
>     >> > data from an Oracle repository and populate a Solr index.
>     Most nights
>     >> > this
>     >> > works fine, but occasionally the job has been hanging at the
>     "Starting
>     >> > up"
>     >> > phase. We have observed this on our test setup also
>     occasionally. A
>     >> > restart
>     >> > of ManifoldCF usually solves this.
>     >> >
>     >> > Using the simple history reports today I looked up all
>     records and
>     >> > sorted
>     >> > them by the "Time" column, largest first, and found the
>     following:
>     >> >
>     >> > "Start Time","Activity","Identifier","Result
>     >> > Code","Bytes","Time","Result
>     >> > Description"
>     >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>     >> > ...","ERROR","0","1926607529","Interrupted: null"
>     >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>     >> > ...","ERROR","0","31644956","Interrupted: null"
>     >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>     >> > ...","ERROR","0","31637594","Interrupted: null"
>     >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>     >> > ...","OK","0","17511",""
>     >> > ... etc ...
>     >> >
>     >> > If the Time column is in millis that means the first query
>     was hanging
>     >> > for
>     >> > 22 days! (This was in the period before we went live when our
>     live
>     >> > server
>     >> > was sitting idle for a while.) The other two occasions it was
>     hanging
>     >> > for
>     >> > about 8 hours until we arrived to restart the job in the
>     morning. I have
>     >> > confirmed that the Oracle database we are connecting to was
>     available
>     >> > throughout these periods. These times are also too long for
>     any network
>     >> > or
>     >> > database timeouts, which makes me suspect that it's a problem
>     with the
>     >> > application.
>     >> >
>     >> > We have the following logging config in properties.xml
>     >> >
>     >> >   <property name="org.apache.manifoldcf.jobs
>     <http://org.apache.manifoldcf.jobs>" value="ALL"/>
>     >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>     >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>     >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>     >> >
>     >> > The job failed again last night and when I checked at 10:40
>     AM this
>     >> > morning
>     >> > the last few lines of manifoldcf.log were:
>     >> >
>     >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking
>     if job
>     >> > 1352455005553 needs to be started; it was last checked at
>     1358733575454,
>     >> > and
>     >> > now it is 1358733585635
>     >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time
>     match found
>     >> > within interval 1358733575454 to 1358733585635
>     >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking
>     if job
>     >> > 1352455005553 needs to be started; it was last checked at
>     1358733585636,
>     >> > and
>     >> > now it is 1358733595662
>     >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time
>     match found
>     >> > within interval 1358733585636 to 1358733595662
>     >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking
>     if job
>     >> > 1352455005553 needs to be started; it was last checked at
>     1358733595663,
>     >> > and
>     >> > now it is 1358733605813
>     >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time
>     match FOUND
>     >> > within
>     >> > interval 1358733595663 to 1358733605813
>     >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job
>     '1352455005553'
>     >> > is
>     >> > within run window at 1358733605813 ms. (which starts at
>     1358733600000
>     >> > ms.)
>     >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled
>     for job
>     >> > start
>     >> > for job 1352455005553
>     >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>     >> > 1352455005553
>     >> > for startup
>     >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The
>     connect string
>     >> > is
>     >> >
>     'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york
>     <http://oradwhlive.york.ac.uk:1521/dwhlive.csrv.york>'
>     >> >
>     >> > After that - nothing. Once I restarted manifoldcf this
>     morning the job
>     >> > magically restarted itself and the following log messages
>     were added
>     >> > where
>     >> > it had left off:
>     >> >
>     >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>     >> > 1352455005553
>     >> > back to 'ReadyForStartup' state
>     >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to
>     restart
>     >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>     >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>     >> > 1352455005553
>     >> > for startup
>     >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The
>     connect string is
>     >> >
>     'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york
>     <http://oradwhlive.york.ac.uk:1521/dwhlive.csrv.york>'
>     >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job
>     1352455005553 is
>     >> > now
>     >> > started
>     >> > ... etc ...
>     >> >
>     >> > So it appears that the application is running fine with a Job
>     start
>     >> > thread
>     >> > logging nicely every ten seconds until it tries to start this
>     job and
>     >> > hangs
>     >> > entirely until the system is shutdown - even though it
>     performed that
>     >> > same
>     >> > tasks perfectly the night before and later the same day.
>     >> >
>     >> > Can anyone advise on what might be happening here? We are running
>     >> > ManifoldCF
>     >> > version 1.0.1 on Ubuntu 10.04.
>     >> >
>     >> > Best wishes,
>     >> > Anthony.
>     >> >
>     >> > --
>     >> > Dr Anthony Leonard
>     >> > System Integrator, Information Directorate
>     >> > University of York, Heslington, York, UK, YO10 5DD
>     >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>     >> > Times Higher Education University of the Year 2010
>     >
>     >
>
>


Mime
View raw message