lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Drob <md...@apache.org>
Subject Re: Solr Web Crawler - Robots.txt
Date Thu, 01 Jun 2017 23:58:17 GMT
Isn't this exactly what Apache Nutch was built for?

On Thu, Jun 1, 2017 at 6:56 PM, David Choi <choi.david.e@gmail.com> wrote:

> In any case after digging further I have found where it checks for
> robots.txt. Thanks!
>
> On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood <wunder@wunderwood.org>
> wrote:
>
> > Which was exactly what I suggested.
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Jun 1, 2017, at 3:31 PM, David Choi <choi.david.e@gmail.com> wrote:
> > >
> > > In the mean time I have found a better solution at the moment is to
> test
> > on
> > > a site that allows users to crawl their site.
> > >
> > > On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.david.e@gmail.com>
> > wrote:
> > >
> > >> I think you misunderstand the argument was about stealing content.
> Sorry
> > >> but I think you need to read what people write before making bold
> > >> statements.
> > >>
> > >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <
> wunder@wunderwood.org>
> > >> wrote:
> > >>
> > >>> Let’s not get snarky right away, especially when you are wrong.
> > >>>
> > >>> Corporations do not generally ignore robots.txt. I worked on a
> > commercial
> > >>> web spider for ten years. Occasionally, our customers did need to
> > bypass
> > >>> portions of robots.txt. That was usually because of a
> > poorly-maintained web
> > >>> server, or because our spider could safely crawl some content that
> > would
> > >>> cause problems for other crawlers.
> > >>>
> > >>> If you want to learn crawling, don’t start by breaking the
> conventions
> > of
> > >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> > >>> preferred portions of a site.
> > >>>
> > >>> https://www.sitemaps.org/index.html <
> > https://www.sitemaps.org/index.html>
> > >>>
> > >>> If the site blocks you, find a different site to learn on.
> > >>>
> > >>> I like the looks of “Scrapy”, written in Python. I haven’t used
it
> for
> > >>> anything big, but I’d start with that for learning.
> > >>>
> > >>> https://scrapy.org/ <https://scrapy.org/>
> > >>>
> > >>> If you want to learn on a site with a lot of content, try ours,
> > chegg.com
> > >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> > it.
> > >>> Any other site will do the same.
> > >>>
> > >>> I would not base the crawler directly on Solr. A crawler needs a
> > >>> dedicated database to record the URLs visited, errors, duplicates,
> > etc. The
> > >>> output of the crawl goes to Solr. That is how we did it with
> Ultraseek
> > >>> (before Solr existed).
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wunder@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> > >>>
> > >>>> On Jun 1, 2017, at 3:01 PM, David Choi <choi.david.e@gmail.com>
> > wrote:
> > >>>>
> > >>>> Oh well I guess its ok if a corporation does it but not someone
> > wanting
> > >>> to
> > >>>> learn more about the field. I actually have written a crawler before
> > as
> > >>>> well as the you know Inverted Index of how solr works but I just
> > thought
> > >>>> its architecture was better suited for scaling.
> > >>>>
> > >>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recursive@gmail.com>
> > >>> wrote:
> > >>>>
> > >>>>> And I mean that in the context of stealing content from sites
that
> > >>>>> explicitly declare they don't want to be crawled. Robots.txt
is to
> be
> > >>>>> followed.
> > >>>>>
> > >>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.david.e@gmail.com>
> > >>> wrote:
> > >>>>>>
> > >>>>>> Hello,
> > >>>>>>
> > >>>>>> I was wondering if anyone could guide me on how to crawl
the web
> and
> > >>>>>> ignore the robots.txt since I can not index some big sites.
Or if
> > >>> someone
> > >>>>>> could point how to get around it. I read somewhere about
a
> > >>>>>> protocol.plugin.check.robots
> > >>>>>> but that was for nutch.
> > >>>>>>
> > >>>>>> The way I index is
> > >>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
> > >>>>>>
> > >>>>>> but I can't index the site I'm guessing because of the
robots.txt.
> > >>>>>> I can index with
> > >>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
> > >>>>>>
> > >>>>>> which I am guessing allows it. I was also wondering how
to find
> the
> > >>> name
> > >>>>> of
> > >>>>>> the crawler bin/post uses.
> > >>>>>
> > >>>
> > >>>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message