manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Schneider <>
Subject Re: Slow performance with a basic setup
Date Wed, 28 Mar 2012 18:09:22 GMT
Ah, thanks!  I set up postgreSQL in my previous installation, but
missed it this time.


On Wed, Mar 28, 2012 at 11:06 AM, Karl Wright <> wrote:
> Now it sounds like you are running into known problems with Apache
> Derby.  That is why we suggest using PostgreSQL rather than Derby for
> any kind of real world crawling.  Derby is super convenient but it has
> problems handling deadlocks properly.
> You can also use HSQLDB if you prefer an integrated solution, but
> PostgreSQL is faster.
> I suggest you look at
> to get an idea what all this is about, and also don't forget to look
> at how-to-build-and-deploy.html for a general idea how to set up both
> single-process and multi-process installations that use PostgreSQL.
> Thanks,
> Karl
> On Wed, Mar 28, 2012 at 1:56 PM, Scott Schneider <> wrote:
>> Thanks for the quick response!  I had been using all the default
>> settings.  Once I deleted the bandwidth throttling, one phase of the
>> job goes much faster.  The # active documents goes from 0 to the total
>> in just a minute or two.  The overall time seems to be shorter, but it
>> still takes about an hour to process ~600 files totaling ~800 kb.  I
>> also increased the max connections to 50 on the web, null, and Solr
>> connections and changed Solr to commit within 30,000 msec rather than
>> at the end of every job.  That does not seem to have made a
>> difference.
>> Actually, I have no idea what state ManifoldCF is in right now.  I hit
>> restart a few hours ago and the status still says "Restarting".  There
>> is nothing in the command windows where I started ManifoldCF or Solr
>> or in the ManifoldCF log file.  The Solr command window does list
>> ManifoldCFSecurityFilter a few times.
>> Scott
>> On Tue, Mar 27, 2012 at 5:37 PM, Karl Wright <> wrote:
>>> Let's start with some basics.
>>> First of all, how many web connections do you have configured?  What
>>> do you have for throttling?  If you have not modified the default
>>> settings for throttling and are pulling a number of documents off of
>>> ONE server, then throttling is probably severely limiting your crawl
>>> speed.
>>> Karl
>>> On Tue, Mar 27, 2012 at 6:24 PM, Scott Schneider <>
>>>> Hi all,
>>>> I have a pretty simple ManifoldCF setup, but I'm getting very slow
>>>> performance.  Can someone help me understand and/or fix this?
>>>> My input is a web connector that goes to an Apache HTTP server running
>>>> on the local machine, serving static text files.  I have a null
>>>> authority service.  I output to Solr, also running locally.
>>>> The data I'm crawling is ~20 MB total in ~8,500 small files.  I start
>>>> the job one afternoon and the next morning, it was not finished!  It
>>>> had only processed ~2,500 documents.  Strangely, it listed ~10,000
>>>> total documents (and ~7,500 active).
>>>> My ultimate goal is to figure out how much space the Solr index takes
>>>> as I add more access tokens.  That's why I'm using the web connector
>>>> and null authority, rather than just using a file system connector.
>>>> Thanks,
>>>> Scott

View raw message