Hi Mark,

Yes, but I'm afraid we *can't* emulate the redirect behavior because that's an upstream connector choice.  WGet can operate in a mode where it uses the pre-redirect URL, and resolves conflicts nonetheless.  How does it do it?


On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <mlibucha@gmail.com> wrote:
wget -x uses the redirect url as the basis for the path it creates.

So, if http://mysite/news returns a 302 redirecting to http://mysite/news/index.html, wget saves as:


MCF, on the other hand, saves as:



On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Mark,

The filesystem connector is supposed to emulate WGET behavior.  What does WGET do in this case?


On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <mlibucha@gmail.com> wrote:
Noticed this problem while crawling a web site and saving to the file system with the FileSystem output connector.

Let's say the website defines a URL like this:


That URI actually gets mapped to a file on the web server, say http://mysite/news/index.html, but the http://mysite/news URI does exist and gets sent as the documentURI to addOrReplaceDocument().

MCF's FileSystem connector gets the http://mysite/news URL and creates a directory for saving that content that looks like this http/mysite/news, where news is a file.

But then if the site also defines a URL like this http://mysite/news/local/today.html, MCF's FileSystem connector fails trying to create the directory http/mysite/news/local because part of it, http/mysite/news, already exists as a file.

Of course, if the URIs are crawled in the reverse order, the file can't be created because a directory already exists with that name.

Make sense?

The real killer is that when this happen it's fatal to the job. That is, it doesn't just fail to get that one URL, the connector returns a fatal error and the crawl is stopped.