manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Performance issues
Date Fri, 18 Jul 2014 16:30:21 GMT
Hi Ameya,

The log file is full of errors of all sorts.  For example:

>>>>>
 WARN 2014-07-17 17:32:38,709 (Worker thread '41') - IO exception during
indexing
file:/C:/Program%20Files/eclipse/configuration/org.eclipse.osgi/.manager/.tmp2043698995563843992.instance:
The process cannot access the file because another process has locked a
portion of the file
java.io.IOException: The process cannot access the file because another
process has locked a portion of the file
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(Unknown Source)
    at
org.apache.http.entity.mime.content.InputStreamBody.writeTo(InputStreamBody.java:91)
    at
org.apache.manifoldcf.agents.output.solr.ModifiedHttpMultipart.doWriteTo(ModifiedHttpMultipart.java:211)
    at
org.apache.manifoldcf.agents.output.solr.ModifiedHttpMultipart.writeTo(ModifiedHttpMultipart.java:229)
    at
org.apache.manifoldcf.agents.output.solr.ModifiedMultipartEntity.writeTo(ModifiedMultipartEntity.java:187)
    at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at
org.apache.http.impl.execchain.RequestEntityExecHandler.invoke(RequestEntityExecHandler.java:77)
    at com.sun.proxy.$Proxy0.writeTo(Unknown Source)
    at
org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:155)
    at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.http.impl.conn.CPoolProxy.invoke(CPoolProxy.java:138)
    at com.sun.proxy.$Proxy1.sendRequestEntity(Unknown Source)
    at
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
    at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
    at
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254)
    at
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
    at
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
    at
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
    at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    at
org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:292)
    at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
    at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
    at
org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:951)
<<<<<

This error occurs because you are trying to index a file on Windows that is
open by an application.  If you do this kind of thing, ManifoldCF will
requeue the document and will try it again later -- say, in 5 minutes, and
keep retrying it for many hours before it gives up.

I suspect that you are not seeing "hangs", but rather situations where MCF
is simply waiting for a problem to resolve.

Karl



On Fri, Jul 18, 2014 at 11:27 AM, Ameya Aware <ameya.aware@gmail.com> wrote:

> Attaching log file
>
>
> On Fri, Jul 18, 2014 at 11:15 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Also, please send the file logs/manifoldcf.log as well -- as a text file.
>>
>> Karl
>>
>>
>> On Fri, Jul 18, 2014 at 11:12 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Could you please get a thread dump and send that to me?  Please send as
>>> a text file not a screen shot.
>>>
>>> To get a thread dump, get the process ID of the agents process, and use
>>> the jdk's jstack utility to obtain the dump.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>> On Fri, Jul 18, 2014 at 11:08 AM, Ameya Aware <ameya.aware@gmail.com>
>>> wrote:
>>>
>>>> yeah.. i thought so that it should not effect in 4000 documents.
>>>>
>>>> I am using filesystem connector to crawl all of my C drive and output
>>>> connection is null.
>>>>
>>>> There are no error logs in MCF. MCF is standstill at same screen since
>>>> half an hour.
>>>>
>>>> Attaching some snapshots for your reference.
>>>>
>>>>
>>>> Thanks,
>>>> Ameya
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 18, 2014 at 11:02 AM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ameya,
>>>>>
>>>>> 4000 documents is nothing at all.  We have load tests which I run on
>>>>> every release that include more than 100000 documents on a crawl.
>>>>>
>>>>> Can you be more specific about the case that you say "hung up"?
>>>>> Specifically:
>>>>>
>>>>> (1) What kind of crawl is this?  SharePoint?  Web?
>>>>> (2) Are there any errors in the manifoldcf log?
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 18, 2014 at 10:59 AM, Ameya Aware <ameya.aware@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> I spent some time going through PostgreSQL 9.3 manual.
>>>>>> I configured PostgreSQL for MCF and saw the significant change in
>>>>>> performance time.
>>>>>>
>>>>>> I ran it yesterday for some 4000 documents. When i started running
>>>>>> again today, the performance was very poor and after 200 documents,
it hung
>>>>>> up.
>>>>>>
>>>>>> Is it because of periodic maintenance it needs?  Also, i would want
>>>>>> to know where and how exactly VACUUM FULL command needs to be used?
>>>>>>
>>>>>> Thanks,
>>>>>> Ameya
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 17, 2014 at 2:13 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It is fine; I am running Postgresql 9.3 here.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 17, 2014 at 2:08 PM, Ameya Aware <ameya.aware@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> is PostgreySQL 9.3 version good because i already have it
in my
>>>>>>>> machine.. Though documentation says "ManifoldCF has been
tested
>>>>>>>> against version 8.3.7, 8.4.5 and 9.1 of PostgreSQL. "
>>>>>>>>
>>>>>>>> Ameya
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 17, 2014 at 1:09 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> If you haven't configured MCF to use PostgreSQL, then
you are
>>>>>>>>> using Derby, which is not recommended for production
use.
>>>>>>>>>
>>>>>>>>> Instructions on how to set up MCF to use PostgreSQL are
available
>>>>>>>>> on the MCF site on the how-to-build-and-deploy page.
 Configuring
>>>>>>>>> PostgreSQL for millions or tens of millions of documents
will require
>>>>>>>>> someone to learn about PostgreSQL and how to administer
it.  The
>>>>>>>>> how-to-build-and-deploy page provides some (old) guidelines
and hints, but
>>>>>>>>> if I were you I'd read the postgresql manual for the
version you install.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 17, 2014 at 1:04 PM, Ameya Aware <
>>>>>>>>> ameya.aware@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ooh ok.
>>>>>>>>>>
>>>>>>>>>> Actually i have never configured PostgreySQL yet.
i am simply
>>>>>>>>>> using binary distribution of MCF to configure file
system connectors to
>>>>>>>>>> connect to Solr.
>>>>>>>>>>
>>>>>>>>>> Do i need to configure PostgreySQL?? How can i proceed
from here
>>>>>>>>>> to check performance measurements?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ameya
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 17, 2014 at 12:10 PM, Karl Wright <daddywri@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes.  Also have a look at the how-to-build-and-deploy
page for
>>>>>>>>>>> hints on how to configure PostgreSQL for maximum
performance.
>>>>>>>>>>>
>>>>>>>>>>> ManifoldCF's performance is almost entirely based
on the
>>>>>>>>>>> database.  If you are using PostgreSQL, which
is the fastest ManifoldCF
>>>>>>>>>>> choice, you should be able to see in the logs
when queries take a long
>>>>>>>>>>> time, or when indexes are automatically rebuilt.
 Could you provide any
>>>>>>>>>>> information as to what your overall system setup
looks like?
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 17, 2014 at 11:32 AM, Ameya Aware
<
>>>>>>>>>>> ameya.aware@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> http://manifoldcf.apache.org/release/trunk/en_US/performance-tuning.html
>>>>>>>>>>>>
>>>>>>>>>>>> This page?
>>>>>>>>>>>>
>>>>>>>>>>>> Ameya
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 17, 2014 at 11:28 AM, Karl Wright
<
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ameya,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Have you read the performance page?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sent from my Windows Phone
>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>> From: Ameya Aware
>>>>>>>>>>>>> Sent: 7/17/2014 11:27 AM
>>>>>>>>>>>>> To: user@manifoldcf.apache.org
>>>>>>>>>>>>> Subject: Performance issues
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have millions of documents to crawl
and send them to Solr.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But when i run it for thousands documents,
it takes too much
>>>>>>>>>>>>> time for it or sometimes it even hangs
up.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So what could be the way to reduce the
performance time?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, i do not need content of the documents,
i just need
>>>>>>>>>>>>> metadata, so can i skip content part
from reading and fetching and will
>>>>>>>>>>>>> that improve performance time?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ameya
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message