MarkMCF, on the other hand, saves as:mysite/news/index.htmlwget -x uses the redirect url as the basis for the path it creates.So, if http://mysite/news returns a 302 redirecting to http://mysite/news/index.html, wget saves as:
On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <firstname.lastname@example.org> wrote:
Hi Mark,The filesystem connector is supposed to emulate WGET behavior. What does WGET do in this case?
KarlOn Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <email@example.com> wrote:
But then if the site also defines a URL like this http://mysite/news/local/today.html, MCF's FileSystem connector fails trying to create the directory http/mysite/news/local because part of it, http/mysite/news, already exists as a file.MCF's FileSystem connector gets the http://mysite/news URL and creates a directory for saving that content that looks like this http/mysite/news, where news is a file.That URI actually gets mapped to a file on the web server, say http://mysite/news/index.html, but the http://mysite/news URI does exist and gets sent as the documentURI to addOrReplaceDocument().http://mysite/newsNoticed this problem while crawling a web site and saving to the file system with the FileSystem output connector.Let's say the website defines a URL like this:
Of course, if the URIs are crawled in the reverse order, the file can't be created because a directory already exists with that name.Make sense?The real killer is that when this happen it's fatal to the job. That is, it doesn't just fail to get that one URL, the connector returns a fatal error and the crawl is stopped.Mark