manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameya Aware <ameya.aw...@gmail.com>
Subject Re: Performance issues
Date Fri, 18 Jul 2014 17:31:19 GMT
No Karl,

I did not do VACUUM here.

Why would queries stopped after running for about 420 sec? is it because of
the errors coming in?


On Fri, Jul 18, 2014 at 12:32 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Ameya,
>
> For future reference, when you see stuff like this in the log:
>
> >>>>>>
>  WARN 2014-07-18 11:19:36,505 (Worker thread '39') - Found a long-running
> query (458934 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeletedeps t0 WHERE t0.jobid=? AND
> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE
> t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND
> t1.isnew=?))]
>  WARN 2014-07-18 11:19:36,505 (Worker thread '4') - Found a long-running
> query (420965 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeletedeps t0 WHERE t0.jobid=? AND
> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE
> t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND
> t1.isnew=?))]
>  WARN 2014-07-18 11:19:36,505 (Worker thread '39') -   Parameter 0: 'D'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '19') - Found a long-running
> query (421120 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeletedeps t0 WHERE t0.jobid=? AND
> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE
> t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND
> t1.isnew=?))]
>  WARN 2014-07-18 11:19:36,505 (Worker thread '10') - Found a long-running
> query (420985 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeletedeps t0 WHERE t0.jobid=? AND
> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE
> t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND
> t1.isnew=?))]
>  WARN 2014-07-18 11:19:36,505 (Worker thread '11') - Found a long-running
> query (421173 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeletedeps t0 WHERE t0.jobid=? AND
> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE
> t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND
> t1.isnew=?))]
>  WARN 2014-07-18 11:19:36,505 (Worker thread '4') -   Parameter 0: 'D'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '11') -   Parameter 0: 'D'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '10') -   Parameter 0: 'D'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '39') -   Parameter 1: '-1'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '19') -   Parameter 0: 'D'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '39') -   Parameter 2:
> '1405692432586'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '10') -   Parameter 1: '-1'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '22') - Found a long-running
> query (421052 ms): [UPDATE hopcount SET deathmark=?,distance=? WHERE id
> IN(SELECT ownerid FROM hopdeletedeps t0 WHERE t0.jobid=? AND
> t0.childidhash=? AND EXISTS(SELECT 'x' FROM intrinsiclink t1 WHERE
> t1.jobid=t0.jobid AND t1.linktype=t0.linktype AND
> t1.parentidhash=t0.parentidhash AND t1.childidhash=t0.childidhash AND
> t1.isnew=?))]
>  WARN 2014-07-18 11:19:36,505 (Worker thread '11') -   Parameter 1: '-1'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '4') -   Parameter 1: '-1'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '11') -   Parameter 2:
> '1405692432586'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '22') -   Parameter 0: 'D'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '10') -   Parameter 2:
> '1405692432586'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '39') -   Parameter 3:
> '9ABFEB709B646CD0C84B4B7B6300E2C9BD5E3477'
>  WARN 2014-07-18 11:19:36,505 (Worker thread '19') -   Parameter 1: '-1'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '39') -   Parameter 4: 'B'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '10') -   Parameter 3:
> 'A932EC77CEF156EA26A4239F12BAB365E6B4F58D'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '22') -   Parameter 1: '-1'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '11') -   Parameter 3:
> '9DFF75EBE13D0AAE8AFF025E992C68AB203ED1CB'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '4') -   Parameter 2:
> '1405692432586'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '11') -   Parameter 4: 'B'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '22') -   Parameter 2:
> '1405692432586'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '22') -   Parameter 3:
> '023FDBD3638711F4E55A918B862A064161B0892A'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '22') -   Parameter 4: 'B'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '10') -   Parameter 4: 'B'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '19') -   Parameter 2:
> '1405692432586'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '4') -   Parameter 3:
> '0158B8EDFEE3DDB10113B6D6E378D5FBF165E1FD'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '19') -   Parameter 3:
> 'FD9641C67D0C1EC22B5F05671513D4DD71B4582C'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '4') -   Parameter 4: 'B'
>  WARN 2014-07-18 11:19:36,506 (Worker thread '19') -   Parameter 4: 'B'
> <<<<<<
>
> ... it means that MANY queries basically stopped running for about 420
> seconds.  I bet you did a VACUUM then, right?
>
> Karl
>
>
>
> On Fri, Jul 18, 2014 at 12:30 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Ameya,
>>
>> The log file is full of errors of all sorts.  For example:
>>
>> >>>>>
>>  WARN 2014-07-17 17:32:38,709 (Worker thread '41') - IO exception during
>> indexing
>> file:/C:/Program%20Files/eclipse/configuration/org.eclipse.osgi/.manager/.tmp2043698995563843992.instance:
>> The process cannot access the file because another process has locked a
>> portion of the file
>> java.io.IOException: The process cannot access the file because another
>> process has locked a portion of the file
>>     at java.io.FileInputStream.readBytes(Native Method)
>>     at java.io.FileInputStream.read(Unknown Source)
>>     at
>> org.apache.http.entity.mime.content.InputStreamBody.writeTo(InputStreamBody.java:91)
>>     at
>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpMultipart.doWriteTo(ModifiedHttpMultipart.java:211)
>>     at
>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpMultipart.writeTo(ModifiedHttpMultipart.java:229)
>>     at
>> org.apache.manifoldcf.agents.output.solr.ModifiedMultipartEntity.writeTo(ModifiedMultipartEntity.java:187)
>>     at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>     at java.lang.reflect.Method.invoke(Unknown Source)
>>     at
>> org.apache.http.impl.execchain.RequestEntityExecHandler.invoke(RequestEntityExecHandler.java:77)
>>     at com.sun.proxy.$Proxy0.writeTo(Unknown Source)
>>     at
>> org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:155)
>>     at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>     at java.lang.reflect.Method.invoke(Unknown Source)
>>     at org.apache.http.impl.conn.CPoolProxy.invoke(CPoolProxy.java:138)
>>     at com.sun.proxy.$Proxy1.sendRequestEntity(Unknown Source)
>>     at
>> org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
>>     at
>> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
>>     at
>> org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254)
>>     at
>> org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
>>     at
>> org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
>>     at
>> org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
>>     at
>> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>>     at
>> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>>     at
>> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>>     at
>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:292)
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
>>     at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>     at
>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:951)
>> <<<<<
>>
>> This error occurs because you are trying to index a file on Windows that
>> is open by an application.  If you do this kind of thing, ManifoldCF will
>> requeue the document and will try it again later -- say, in 5 minutes, and
>> keep retrying it for many hours before it gives up.
>>
>> I suspect that you are not seeing "hangs", but rather situations where
>> MCF is simply waiting for a problem to resolve.
>>
>> Karl
>>
>>
>>
>> On Fri, Jul 18, 2014 at 11:27 AM, Ameya Aware <ameya.aware@gmail.com>
>> wrote:
>>
>>> Attaching log file
>>>
>>>
>>> On Fri, Jul 18, 2014 at 11:15 AM, Karl Wright <daddywri@gmail.com>
>>> wrote:
>>>
>>>> Also, please send the file logs/manifoldcf.log as well -- as a text
>>>> file.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, Jul 18, 2014 at 11:12 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> Could you please get a thread dump and send that to me?  Please send
>>>>> as a text file not a screen shot.
>>>>>
>>>>> To get a thread dump, get the process ID of the agents process, and
>>>>> use the jdk's jstack utility to obtain the dump.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 18, 2014 at 11:08 AM, Ameya Aware <ameya.aware@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> yeah.. i thought so that it should not effect in 4000 documents.
>>>>>>
>>>>>> I am using filesystem connector to crawl all of my C drive and output
>>>>>> connection is null.
>>>>>>
>>>>>> There are no error logs in MCF. MCF is standstill at same screen
>>>>>> since half an hour.
>>>>>>
>>>>>> Attaching some snapshots for your reference.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ameya
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 18, 2014 at 11:02 AM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ameya,
>>>>>>>
>>>>>>> 4000 documents is nothing at all.  We have load tests which I
run on
>>>>>>> every release that include more than 100000 documents on a crawl.
>>>>>>>
>>>>>>> Can you be more specific about the case that you say "hung up"?
>>>>>>> Specifically:
>>>>>>>
>>>>>>> (1) What kind of crawl is this?  SharePoint?  Web?
>>>>>>> (2) Are there any errors in the manifoldcf log?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 18, 2014 at 10:59 AM, Ameya Aware <ameya.aware@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> I spent some time going through PostgreSQL 9.3 manual.
>>>>>>>> I configured PostgreSQL for MCF and saw the significant change
in
>>>>>>>> performance time.
>>>>>>>>
>>>>>>>> I ran it yesterday for some 4000 documents. When i started
running
>>>>>>>> again today, the performance was very poor and after 200
documents, it hung
>>>>>>>> up.
>>>>>>>>
>>>>>>>> Is it because of periodic maintenance it needs?  Also, i
would want
>>>>>>>> to know where and how exactly VACUUM FULL command needs to
be used?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ameya
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 17, 2014 at 2:13 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> It is fine; I am running Postgresql 9.3 here.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 17, 2014 at 2:08 PM, Ameya Aware <
>>>>>>>>> ameya.aware@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> is PostgreySQL 9.3 version good because i already
have it in my
>>>>>>>>>> machine.. Though documentation says "ManifoldCF has
been tested
>>>>>>>>>> against version 8.3.7, 8.4.5 and 9.1 of PostgreSQL.
"
>>>>>>>>>>
>>>>>>>>>> Ameya
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 17, 2014 at 1:09 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> If you haven't configured MCF to use PostgreSQL,
then you are
>>>>>>>>>>> using Derby, which is not recommended for production
use.
>>>>>>>>>>>
>>>>>>>>>>> Instructions on how to set up MCF to use PostgreSQL
are
>>>>>>>>>>> available on the MCF site on the how-to-build-and-deploy
page.  Configuring
>>>>>>>>>>> PostgreSQL for millions or tens of millions of
documents will require
>>>>>>>>>>> someone to learn about PostgreSQL and how to
administer it.  The
>>>>>>>>>>> how-to-build-and-deploy page provides some (old)
guidelines and hints, but
>>>>>>>>>>> if I were you I'd read the postgresql manual
for the version you install.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 17, 2014 at 1:04 PM, Ameya Aware
<
>>>>>>>>>>> ameya.aware@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ooh ok.
>>>>>>>>>>>>
>>>>>>>>>>>> Actually i have never configured PostgreySQL
yet. i am simply
>>>>>>>>>>>> using binary distribution of MCF to configure
file system connectors to
>>>>>>>>>>>> connect to Solr.
>>>>>>>>>>>>
>>>>>>>>>>>> Do i need to configure PostgreySQL?? How
can i proceed from
>>>>>>>>>>>> here to check performance measurements?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ameya
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 17, 2014 at 12:10 PM, Karl Wright
<
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes.  Also have a look at the how-to-build-and-deploy
page for
>>>>>>>>>>>>> hints on how to configure PostgreSQL
for maximum performance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ManifoldCF's performance is almost entirely
based on the
>>>>>>>>>>>>> database.  If you are using PostgreSQL,
which is the fastest ManifoldCF
>>>>>>>>>>>>> choice, you should be able to see in
the logs when queries take a long
>>>>>>>>>>>>> time, or when indexes are automatically
rebuilt.  Could you provide any
>>>>>>>>>>>>> information as to what your overall system
setup looks like?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 17, 2014 at 11:32 AM, Ameya
Aware <
>>>>>>>>>>>>> ameya.aware@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This page?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ameya
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Jul 17, 2014 at 11:28 AM,
Karl Wright <
>>>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Ameya,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Have you read the performance
page?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sent from my Windows Phone
>>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>> From: Ameya Aware
>>>>>>>>>>>>>>> Sent: 7/17/2014 11:27 AM
>>>>>>>>>>>>>>> To: user@manifoldcf.apache.org
>>>>>>>>>>>>>>> Subject: Performance issues
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have millions of documents
to crawl and send them to Solr.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But when i run it for thousands
documents, it takes too much
>>>>>>>>>>>>>>> time for it or sometimes it even
hangs up.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So what could be the way to reduce
the performance time?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, i do not need content of
the documents, i just need
>>>>>>>>>>>>>>> metadata, so can i skip content
part from reading and fetching and will
>>>>>>>>>>>>>>> that improve performance time?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ameya
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message