manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject RE: How to index files, not folders
Date Sat, 09 Nov 2013 13:02:21 GMT
Hi Ronny,

You aren't indexing any folders.  But you must process them in order to
find the files and subfolders that are in them.

Karl

Sent from my Windows Phone
------------------------------
From: Ronny Heylen
Sent: 11/9/2013 7:50 AM
To: user@manifoldcf.apache.org
Subject: How to index files, not folders

Hi,
Indexing all indexable files on our Windows drive fails with different
problems.
Several of these problems were solved by the list, thanks for that, now we
still have (at least) the missing class in common-compress problem. Using
jar from common-compress 1.6 did not help.
Anyway, this introduction is just to explain our approach to have most of
interesting files indexed and to "easily" identify where the problems are:
we have one job for all *.doc*, one for all *.xls*, ...
We observe that on the drive we have:
84000 *.doc* files
172000 *.xls* files
161000 folders
If we just index *.doc*, it give nothing, we have to say indexable files
*.doc* and folder *
Then the job indexes 245000 documents (=number of *.doc* + number of
folders)
The same for *.xls* => indexing 333000 documents
If we define a job *.doc* + *.xls* + folder we get a 417000 documents job.
So we suppose that with two jobs (one doc and one xls) the folders are
indexed twice.
The question is: how can we avoid to index folders?
Perhaps there is another way to define the paths in the rule set, to avoid
indexing folders? But how?
Thanks,

Mime
View raw message