manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Leonard <anthony.leon...@york.ac.uk>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Tue, 22 Jan 2013 17:00:55 GMT
Dear Karl,

Our DBA noticed that each time our job was run 10 Oracle connections were
created. So, we dropped the "Max connections" parameter on the repository
connection config to 1 and re-ran the job with the DBA watching. The job
worked fine but the DBA reported that 1 connection was created and then 10
more briefly ...

Out of curiosity we re-ran the job again with no further changes and this
time got the following results:

* the job hung in the "Starting Up" phase again, with the same logging and
symptoms as detailed before on this thread.
* the DBA reported seeing no connections at all this time.
* I have attached a thread dump created by "jstack -F <pid>". This is
reporting all threads as blocked.

Any ideas? Any help with this would certainly be very gratefully received.

Best wishes,
Anthony.

-- 
Dr Anthony Leonard
System Integrator, Information Directorate
University of York, Heslington, York, UK, YO10 5DD
Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
Times Higher Education University of the Year 2010


On Mon, Jan 21, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com> wrote:

> kill -QUIT should not abort the agents process, just cause a thread
> dump.  kill -9 is a different story.
>
> You can also do the same thing by using jstack, in the jvm bin directory.
>
> Karl
>
>
> On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
> <anthony.leonard@york.ac.uk> wrote:
> > Dear Karl,
> >
> > Many thanks for your insights. I'll do a kill -QUIT next time we have
> this
> > issue which should hopefully give me the thread dump. However we've
> noticed
> > that killing processes means we have to run the "locks-clean" script so
> it's
> > not our favourite way of doing it.
> >
> > Also I definitely think a timeout for queries would be a good thing.
> >
> > I guess we go back to checking that the connection to the database should
> > have been ok last night...
> >
> > Best wishes,
> > Anthony.
> >
> > --
> > Dr Anthony Leonard
> > System Integrator, Information Directorate
> > University of York, Heslington, York, UK, YO10 5DD
> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> > Times Higher Education University of the Year 2010
> >
> >
> > On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <daddywri@gmail.com> wrote:
> >>
> >> Hi Anthony,
> >>
> >> What happens between the framework recognizing that the job should be
> >> started (which it does fine in both cases), and actually achieving a
> >> correct job start, is the "seeding" phase, which is going to try to
> >> execute the seeding query against your Oracle database.  If something
> >> happens at that time to hang the JDBC connection's seeding query, then
> >> it precisely explains the behavior you are seeing.
> >>
> >> It is also the case that the timeout on the queries that the JDBC
> >> connector does is effectively infinite.  This makes me suspicious that
> >> what is happening is an Oracle query is going out but there is no
> >> response ever coming back.
> >>
> >> The other possibility is that the JDBC connector is in fact correctly
> >> throwing a ServiceInterruption, but that the ManifoldCF code is either
> >> not handling it properly, or the connector is not forming it properly.
> >>  In that case, when you notice a hung job, the startup thread will be
> >> a particular place in the code, and I can diagnose it that way.
> >>
> >> The first order of business is therefore to get a thread dump when the
> >> system is "hung".  That will help confirm the picture.  There are a
> >> number of additional questions here.
> >>
> >> (1) Why is this happening?  Is there any possibility that the Oracle
> >> database you are crawling is (very occasionally) not able to properly
> >> respond to a JDBC query?  I can imagine that, under some network
> >> conditions, it might be possible for the Oracle JDBC driver to wind up
> >> waiting indefinitely for a response that never comes.
> >>
> >> (2) Given that we can't always control the infrastructure we're trying
> >> to crawl through, should we attempt to provide a reasonable
> >> workaround?  For example, a timeout on JDBC connector queries, where
> >> we throw a ServiceInterruption if the timeout is exceeded?
> >>
> >> Karl
> >>
> >> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
> >> <anthony.leonard@york.ac.uk> wrote:
> >> > Hi there,
> >> >
> >> > We have recently started running a nightly job 2AM in ManifoldCF to
> >> > extract
> >> > data from an Oracle repository and populate a Solr index. Most nights
> >> > this
> >> > works fine, but occasionally the job has been hanging at the "Starting
> >> > up"
> >> > phase. We have observed this on our test setup also occasionally. A
> >> > restart
> >> > of ManifoldCF usually solves this.
> >> >
> >> > Using the simple history reports today I looked up all records and
> >> > sorted
> >> > them by the "Time" column, largest first, and found the following:
> >> >
> >> > "Start Time","Activity","Identifier","Result
> >> > Code","Bytes","Time","Result
> >> > Description"
> >> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
> >> > ...","ERROR","0","1926607529","Interrupted: null"
> >> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
> >> > ...","ERROR","0","31644956","Interrupted: null"
> >> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
> >> > ...","ERROR","0","31637594","Interrupted: null"
> >> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
> >> > ...","OK","0","17511",""
> >> > ... etc ...
> >> >
> >> > If the Time column is in millis that means the first query was hanging
> >> > for
> >> > 22 days! (This was in the period before we went live when our live
> >> > server
> >> > was sitting idle for a while.) The other two occasions it was hanging
> >> > for
> >> > about 8 hours until we arrived to restart the job in the morning. I
> have
> >> > confirmed that the Oracle database we are connecting to was available
> >> > throughout these periods. These times are also too long for any
> network
> >> > or
> >> > database timeouts, which makes me suspect that it's a problem with the
> >> > application.
> >> >
> >> > We have the following logging config in properties.xml
> >> >
> >> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
> >> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
> >> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
> >> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
> >> >
> >> > The job failed again last night and when I checked at 10:40 AM this
> >> > morning
> >> > the last few lines of manifoldcf.log were:
> >> >
> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
> >> > 1352455005553 needs to be started; it was last checked at
> 1358733575454,
> >> > and
> >> > now it is 1358733585635
> >> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match
> found
> >> > within interval 1358733575454 to 1358733585635
> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
> >> > 1352455005553 needs to be started; it was last checked at
> 1358733585636,
> >> > and
> >> > now it is 1358733595662
> >> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match
> found
> >> > within interval 1358733585636 to 1358733595662
> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
> >> > 1352455005553 needs to be started; it was last checked at
> 1358733595663,
> >> > and
> >> > now it is 1358733605813
> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
> >> > within
> >> > interval 1358733595663 to 1358733605813
> >> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553'
> >> > is
> >> > within run window at 1358733605813 ms. (which starts at 1358733600000
> >> > ms.)
> >> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
> >> > start
> >> > for job 1352455005553
> >> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
> >> > 1352455005553
> >> > for startup
> >> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect
> string
> >> > is
> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >> >
> >> > After that - nothing. Once I restarted manifoldcf this morning the job
> >> > magically restarted itself and the following log messages were added
> >> > where
> >> > it had left off:
> >> >
> >> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
> >> > 1352455005553
> >> > back to 'ReadyForStartup' state
> >> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
> >> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
> >> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
> >> > 1352455005553
> >> > for startup
> >> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string
> is
> >> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
> >> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is
> >> > now
> >> > started
> >> > ... etc ...
> >> >
> >> > So it appears that the application is running fine with a Job start
> >> > thread
> >> > logging nicely every ten seconds until it tries to start this job and
> >> > hangs
> >> > entirely until the system is shutdown - even though it performed that
> >> > same
> >> > tasks perfectly the night before and later the same day.
> >> >
> >> > Can anyone advise on what might be happening here? We are running
> >> > ManifoldCF
> >> > version 1.0.1 on Ubuntu 10.04.
> >> >
> >> > Best wishes,
> >> > Anthony.
> >> >
> >> > --
> >> > Dr Anthony Leonard
> >> > System Integrator, Information Directorate
> >> > University of York, Heslington, York, UK, YO10 5DD
> >> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> >> > Times Higher Education University of the Year 2010
> >
> >
>

Mime
View raw message