manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject RE: Pushing extra items into an index (outside normal crawl job)
Date Fri, 16 Aug 2013 11:34:06 GMT
It sounds like if you keep the number of directories small it would better,
I agree.

Karl

Sent from my Windows Phone
------------------------------
From: Adrian Conlon
Sent: 8/16/2013 6:48 AM
To: user@manifoldcf.apache.org
Subject: RE: Pushing extra items into an index (outside normal crawl job)

    Hi Karl,



I guessed the discovery of documents was causing the issue.  I note that on
my test system (an Amazon EC2 m1.medium instance), I’m pretty much CPU
bound with the “agents” process taking up typically 60-70% CPU during a
crawl…



We’ve already got a Windows desktop application that files emails onto
Windows shares, so my current thought is that I notify a web service (that
I create) with change requests, bundle these up into job definitions (say,
once every five minutes) and see how that goes.  If I create a new job for
each directory, with a set of file patterns matching the changed files,
then the job should (hopefully) have much less work to do that a full deep
hierarchy scan.



Adrian





*From:* Karl Wright [mailto:daddywri@gmail.com]
*Sent:* 16 August 2013 11:17
*To:* Adrian Conlon; user@manifoldcf.apache.org
*Subject:* RE: Pushing extra items into an index (outside normal crawl job)



Hi Adrian,

Jcifs is one of the connectors that must find changes mainly by discovery,
so it does not do much better with a minimal crawl.  A higher priority
won't help much either for similar reasons.  How do propose quickly finding
the changed documents?  That is where the problem lies.

Karl

Sent from my Windows Phone
  ------------------------------

*From: *Adrian Conlon
*Sent: *8/16/2013 5:53 AM
*To: *user@manifoldcf.apache.org
*Subject: *RE: Pushing extra items into an index (outside normal crawl job)

Thanks Karl,



That’s an interesting thought.  So if I’ve understood what you’re saying
correctly, I could create a temporary job, set the priority to one, start
it, and that’s it?  Individual job queues are effectively handled
separately?  There might be a number of temporary jobs on the go at any
time, I guess, since they couldn’t be deleted until the job has finished.
Do you think that would be an issue?  In any event, that’s given me food
for thought, so I’ll take a look on that basis.



With regards your second thought.  I had high hopes for a minimal job run,
but they seem to take almost as long as a full job run.  I haven’t really
sat down and worked out timings, but a speed up of about 10% on a
reasonably sized (400,000 documents or so) JCIFS repository was all I saw.
Is that what you’d expect?



Adrian



*From:* Karl Wright [mailto:daddywri@gmail.com]
*Sent:* 15 August 2013 18:39
*To:* Adrian Conlon; user@manifoldcf.apache.org
*Subject:* RE: Pushing extra items into an index (outside normal crawl job)



Hi Adrian,

There is already a concept of job priority.  It is on a scale of one to
ten, by default the value is 5.

Your second idea is also somewhat similar to a "minimal" job run.  Might
want to look into that as well.  Depending on your connector these two
constructs together might well work for you.

Karl

Sent from my Windows Phone
  ------------------------------

*From: *Adrian Conlon
*Sent: *8/15/2013 12:18 PM
*To: *user@manifoldcf.apache.org
*Subject: *Pushing extra items into an index (outside normal crawl job)

Hi All,



I’ve been asked to consider adding items to an index outside normal
repository crawl job processing (e.g. to reduce the latency of a document
being added to a repository and being available in the index)



My initial thoughts on this are that this doesn’t really fit in with the
current ManifoldCF architecture.



With that in mind, I’ve come up with a couple of ideas (neither tested, nor
thought through!) that I’d like to run past the list to see whether they:



a)      Have the possibility of being reasonable

b)      Might be something that could be passed back into the ManifoldCF
project (perhaps as a contrib)



Idea one (probably the most work, but perhaps architecturally most clean):



1)      Introduce the idea of priority into ManifoldCF queues

2)      Add an extra “mcf” web service that allows queue injection



Idea two (easiest, if it works, but quite “hacky”):



1)      Add a web service that uses some “mcf” code to send documents
directly to the output connector

2)      Obviously, this can’t go through the ManifoldCF queues

3)      Relies upon a normal mcf job to tidy up any anomalies that might
have occurred (deleting and re-ingesting would be fine, I think)



How do these sound?  Are they worth thinking about?  Or indeed (better
yet!), is there a better way I haven’t thought of…?



Thanks,



Adrian

____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

Mime
View raw message