manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Job hanging on "Starting up" with never ending external query.
Date Mon, 21 Jan 2013 14:15:40 GMT
kill -QUIT should not abort the agents process, just cause a thread
dump.  kill -9 is a different story.

You can also do the same thing by using jstack, in the jvm bin directory.

Karl


On Mon, Jan 21, 2013 at 9:04 AM, Anthony Leonard
<anthony.leonard@york.ac.uk> wrote:
> Dear Karl,
>
> Many thanks for your insights. I'll do a kill -QUIT next time we have this
> issue which should hopefully give me the thread dump. However we've noticed
> that killing processes means we have to run the "locks-clean" script so it's
> not our favourite way of doing it.
>
> Also I definitely think a timeout for queries would be a good thing.
>
> I guess we go back to checking that the connection to the database should
> have been ok last night...
>
> Best wishes,
> Anthony.
>
> --
> Dr Anthony Leonard
> System Integrator, Information Directorate
> University of York, Heslington, York, UK, YO10 5DD
> Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
> Times Higher Education University of the Year 2010
>
>
> On Mon, Jan 21, 2013 at 1:25 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Hi Anthony,
>>
>> What happens between the framework recognizing that the job should be
>> started (which it does fine in both cases), and actually achieving a
>> correct job start, is the "seeding" phase, which is going to try to
>> execute the seeding query against your Oracle database.  If something
>> happens at that time to hang the JDBC connection's seeding query, then
>> it precisely explains the behavior you are seeing.
>>
>> It is also the case that the timeout on the queries that the JDBC
>> connector does is effectively infinite.  This makes me suspicious that
>> what is happening is an Oracle query is going out but there is no
>> response ever coming back.
>>
>> The other possibility is that the JDBC connector is in fact correctly
>> throwing a ServiceInterruption, but that the ManifoldCF code is either
>> not handling it properly, or the connector is not forming it properly.
>>  In that case, when you notice a hung job, the startup thread will be
>> a particular place in the code, and I can diagnose it that way.
>>
>> The first order of business is therefore to get a thread dump when the
>> system is "hung".  That will help confirm the picture.  There are a
>> number of additional questions here.
>>
>> (1) Why is this happening?  Is there any possibility that the Oracle
>> database you are crawling is (very occasionally) not able to properly
>> respond to a JDBC query?  I can imagine that, under some network
>> conditions, it might be possible for the Oracle JDBC driver to wind up
>> waiting indefinitely for a response that never comes.
>>
>> (2) Given that we can't always control the infrastructure we're trying
>> to crawl through, should we attempt to provide a reasonable
>> workaround?  For example, a timeout on JDBC connector queries, where
>> we throw a ServiceInterruption if the timeout is exceeded?
>>
>> Karl
>>
>> On Mon, Jan 21, 2013 at 7:57 AM, Anthony Leonard
>> <anthony.leonard@york.ac.uk> wrote:
>> > Hi there,
>> >
>> > We have recently started running a nightly job 2AM in ManifoldCF to
>> > extract
>> > data from an Oracle repository and populate a Solr index. Most nights
>> > this
>> > works fine, but occasionally the job has been hanging at the "Starting
>> > up"
>> > phase. We have observed this on our test setup also occasionally. A
>> > restart
>> > of ManifoldCF usually solves this.
>> >
>> > Using the simple history reports today I looked up all records and
>> > sorted
>> > them by the "Time" column, largest first, and found the following:
>> >
>> > "Start Time","Activity","Identifier","Result
>> > Code","Bytes","Time","Result
>> > Description"
>> > "11-12-2012 05:00:05.941","external query" "... SQL QUERY
>> > ...","ERROR","0","1926607529","Interrupted: null"
>> > "01-21-2013 02:00:11.843","external query" "... SQL QUERY
>> > ...","ERROR","0","31644956","Interrupted: null"
>> > "01-17-2013 02:00:03.600","external query" "... SQL QUERY
>> > ...","ERROR","0","31637594","Interrupted: null"
>> > "12-04-2012 12:12:19.860","external query" "... SQL QUERY
>> > ...","OK","0","17511",""
>> > ... etc ...
>> >
>> > If the Time column is in millis that means the first query was hanging
>> > for
>> > 22 days! (This was in the period before we went live when our live
>> > server
>> > was sitting idle for a while.) The other two occasions it was hanging
>> > for
>> > about 8 hours until we arrived to restart the job in the morning. I have
>> > confirmed that the Oracle database we are connecting to was available
>> > throughout these periods. These times are also too long for any network
>> > or
>> > database timeouts, which makes me suspect that it's a problem with the
>> > application.
>> >
>> > We have the following logging config in properties.xml
>> >
>> >   <property name="org.apache.manifoldcf.jobs" value="ALL"/>
>> >   <property name="org.apache.manifoldcf.connectors" value="ALL"/>
>> >   <property name="org.apache.manifoldcf.agents" value="ALL"/>
>> >   <property name="org.apache.manifoldcf.misc" value="ALL"/>
>> >
>> > The job failed again last night and when I checked at 10:40 AM this
>> > morning
>> > the last few lines of manifoldcf.log were:
>> >
>> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) - Checking if job
>> > 1352455005553 needs to be started; it was last checked at 1358733575454,
>> > and
>> > now it is 1358733585635
>> > DEBUG 2013-01-21 01:59:45,654 (Job start thread) -  No time match found
>> > within interval 1358733575454 to 1358733585635
>> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) - Checking if job
>> > 1352455005553 needs to be started; it was last checked at 1358733585636,
>> > and
>> > now it is 1358733595662
>> > DEBUG 2013-01-21 01:59:55,805 (Job start thread) -  No time match found
>> > within interval 1358733585636 to 1358733595662
>> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Checking if job
>> > 1352455005553 needs to be started; it was last checked at 1358733595663,
>> > and
>> > now it is 1358733605813
>> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) -  Time match FOUND
>> > within
>> > interval 1358733595663 to 1358733605813
>> > DEBUG 2013-01-21 02:00:05,821 (Job start thread) - Job '1352455005553'
>> > is
>> > within run window at 1358733605813 ms. (which starts at 1358733600000
>> > ms.)
>> > DEBUG 2013-01-21 02:00:05,830 (Job start thread) - Signalled for job
>> > start
>> > for job 1352455005553
>> > DEBUG 2013-01-21 02:00:11,674 (Startup thread) - Marked job
>> > 1352455005553
>> > for startup
>> > DEBUG 2013-01-21 02:00:11,843 (Thread-951922) - JDBC: The connect string
>> > is
>> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> >
>> > After that - nothing. Once I restarted manifoldcf this morning the job
>> > magically restarted itself and the following log messages were added
>> > where
>> > it had left off:
>> >
>> > DEBUG 2013-01-21 10:47:36,852 (Startup thread) - Setting job
>> > 1352455005553
>> > back to 'ReadyForStartup' state
>> > DEBUG 2013-01-21 10:48:57,538 (Thread-98) - Resetting due to restart
>> > DEBUG 2013-01-21 10:48:57,958 (Thread-98) - Reset complete
>> > DEBUG 2013-01-21 10:48:58,200 (Startup thread) - Marked job
>> > 1352455005553
>> > for startup
>> > DEBUG 2013-01-21 10:48:58,367 (Thread-184) - JDBC: The connect string is
>> > 'jdbc:oracle:thin:@//oradwhlive.york.ac.uk:1521/dwhlive.csrv.york'
>> > DEBUG 2013-01-21 10:49:02,071 (Startup thread) - Job 1352455005553 is
>> > now
>> > started
>> > ... etc ...
>> >
>> > So it appears that the application is running fine with a Job start
>> > thread
>> > logging nicely every ten seconds until it tries to start this job and
>> > hangs
>> > entirely until the system is shutdown - even though it performed that
>> > same
>> > tasks perfectly the night before and later the same day.
>> >
>> > Can anyone advise on what might be happening here? We are running
>> > ManifoldCF
>> > version 1.0.1 on Ubuntu 10.04.
>> >
>> > Best wishes,
>> > Anthony.
>> >
>> > --
>> > Dr Anthony Leonard
>> > System Integrator, Information Directorate
>> > University of York, Heslington, York, UK, YO10 5DD
>> > Tel: +44 (0)1904 434350 http://twitter.com/apbleonard
>> > Times Higher Education University of the Year 2010
>
>

Mime
View raw message