manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Libucha <>
Subject FileSystem connector path issue
Date Tue, 19 Nov 2013 21:17:48 GMT
Noticed this problem while crawling a web site and saving to the file
system with the FileSystem output connector.

Let's say the website defines a URL like this:


That URI actually gets mapped to a file on the web server, say
http://mysite/news/index.html, but the http://mysite/news URI does exist
and gets sent as the documentURI to addOrReplaceDocument().

MCF's FileSystem connector gets the http://mysite/news URL and creates a
directory for saving that content that looks like this http/mysite/news,
where news is a file.

But then if the site also defines a URL like this
http://mysite/news/local/today.html, MCF's FileSystem connector fails
trying to create the directory http/mysite/news/local because part of it,
http/mysite/news, already exists as a file.

Of course, if the URIs are crawled in the reverse order, the file can't be
created because a directory already exists with that name.

Make sense?

The real killer is that when this happen it's fatal to the job. That is, it
doesn't just fail to get that one URL, the connector returns a fatal error
and the crawl is stopped.


View raw message