manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jetnet <>
Subject Efficient delta (incremental) indexing
Date Wed, 03 Aug 2016 10:15:53 GMT
Hi All,

I’m trying to find a way to reduce the time spent on incremental runs of
the crawler (HTTP, file system, file share) by creating a list of modified
files (created/modified and deleted).

The challenge is how to supply the crawler with such list?

There are great interfaces (JSON API and scripting language), which could
be used for that, but:

1) no deletion command gets sent to the index for NOT-Found (deleted files)
entries from the modification list, if the crawler hasn’t indexed these
files before

2a) re-using one “incremental” job: crawler would delete the previously
indexed documents, if it they don’t appear on the modification list anymore

2b) re-creating the “incremental” job every time: crawler would delete ALL
previous indexed docs from the index, if the job gets deleted

So, currently I see no possibilities for the incremental indexing based on
a modification list without extending the functionality of the framework,
or maybe I missed something and there are features  I’m not aware of?





View raw message