manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: webcrawler connector and dns lookups behind corporate http proxy
Date Tue, 11 Oct 2016 22:37:22 GMT
Hi Markus,

Many crawlers do the DNS lookup once to save time and network bandwidth.
MCF's web crawler is no different.

Thanks,
Karl


On Tue, Oct 11, 2016 at 6:11 PM, Markus Schuch <markus_schuch@web.de> wrote:

> Hi Karl,
>
> thanks for the suggestion. I tried it but the crawled website sends 301
> redirects to the canonical hostname when requesting pages directly via ip
> address - which leads again to the ip lookup.
> Guess i'm stuck with the /etc/hosts solution then. This will get messy if
> the ip changes often.
>
> I'm interested to understand the mechanics of the crawler better: what is
> the reason for resolving the IP addresses instead of using the Hostnamen?
>
> Thanks
> Markus
>
>
> Gesendet: Montag, 10. Oktober 2016 um 22:00 Uhr
> Von: "Karl Wright" <daddywri@gmail.com>
> An: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
> Betreff: Re: webcrawler connector and dns lookups behind corporate http
> proxy
>
> If the proxy is not authenticated, I think you can just put the IP address
> in as the machine name and it should work.  But that's all I can think of.
>
> Karl
>
>
> On Mon, Oct 10, 2016 at 3:44 PM, Markus Schuch <markus_schuch@web.de
> [mailto:markus_schuch@web.de]> wrote:Hi @ the lovely mcf community out
> there,
>
> in our setup we run manifoldcf (2.3) behind a corporate http proxy server
> and we try to crawl specific web pages in the internet.
>
> We run into java.net[http://java.net].UnknownHostException because the
> connector tries to resolve the ip of the hostname. This fails, because our
> network setup does not allow direct dns lookups for internet pages and
> the JDKs InetAddress.getByName() call relies on the systems dns lookup
> mechanisms. All internet traffic goes through the corporate http proxy
> server which does all necessary dns resolution on his side.
>
> Can you think of any other (more elegant) solution besides adding the
> records to /etc/hosts on the crawlers machine?
>
> Many thanks in advance,
> Markus
>
>
>

Mime
View raw message