lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charlie Hull <char...@flax.co.uk>
Subject Re: Solr Web Crawler - Robots.txt
Date Fri, 02 Jun 2017 09:26:38 GMT
On 02/06/2017 00:56, Doug Turnbull wrote:
> Scrapy is fantastic and I use it scrape search results pages for clients to
> take quality snapshots for relevance work

+1 for Scrapy; it was built by a team at Mydeco.com while we were 
building their search backend and has gone from strength to strength since.

Cheers

Charlie
>
> Ignoring robots.txt sometimes legit comes up because a staging site might
> be telling google not to crawl but don't care about a developer crawling
> for internal purposes.
>
> Doug
> On Thu, Jun 1, 2017 at 6:34 PM Walter Underwood <wunder@wunderwood.org>
> wrote:
>
>> Which was exactly what I suggested.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Jun 1, 2017, at 3:31 PM, David Choi <choi.david.e@gmail.com> wrote:
>>>
>>> In the mean time I have found a better solution at the moment is to test
>> on
>>> a site that allows users to crawl their site.
>>>
>>> On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.david.e@gmail.com>
>> wrote:
>>>
>>>> I think you misunderstand the argument was about stealing content. Sorry
>>>> but I think you need to read what people write before making bold
>>>> statements.
>>>>
>>>> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wunder@wunderwood.org>
>>>> wrote:
>>>>
>>>>> Let’s not get snarky right away, especially when you are wrong.
>>>>>
>>>>> Corporations do not generally ignore robots.txt. I worked on a
>> commercial
>>>>> web spider for ten years. Occasionally, our customers did need to
>> bypass
>>>>> portions of robots.txt. That was usually because of a
>> poorly-maintained web
>>>>> server, or because our spider could safely crawl some content that
>> would
>>>>> cause problems for other crawlers.
>>>>>
>>>>> If you want to learn crawling, don’t start by breaking the conventions
>> of
>>>>> good web citizenship. Instead, start with sitemap.xml and crawl the
>>>>> preferred portions of a site.
>>>>>
>>>>> https://www.sitemaps.org/index.html <
>> https://www.sitemaps.org/index.html>
>>>>>
>>>>> If the site blocks you, find a different site to learn on.
>>>>>
>>>>> I like the looks of “Scrapy”, written in Python. I haven’t used
it for
>>>>> anything big, but I’d start with that for learning.
>>>>>
>>>>> https://scrapy.org/ <https://scrapy.org/>
>>>>>
>>>>> If you want to learn on a site with a lot of content, try ours,
>> chegg.com
>>>>> But if your crawler gets out of hand, crawling too fast, we’ll block
>> it.
>>>>> Any other site will do the same.
>>>>>
>>>>> I would not base the crawler directly on Solr. A crawler needs a
>>>>> dedicated database to record the URLs visited, errors, duplicates,
>> etc. The
>>>>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>>>>> (before Solr existed).
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>
>>>>>
>>>>>> On Jun 1, 2017, at 3:01 PM, David Choi <choi.david.e@gmail.com>
>> wrote:
>>>>>>
>>>>>> Oh well I guess its ok if a corporation does it but not someone
>> wanting
>>>>> to
>>>>>> learn more about the field. I actually have written a crawler before
>> as
>>>>>> well as the you know Inverted Index of how solr works but I just
>> thought
>>>>>> its architecture was better suited for scaling.
>>>>>>
>>>>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recursive@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>> And I mean that in the context of stealing content from sites
that
>>>>>>> explicitly declare they don't want to be crawled. Robots.txt
is to be
>>>>>>> followed.
>>>>>>>
>>>>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.david.e@gmail.com>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I was wondering if anyone could guide me on how to crawl
the web and
>>>>>>>> ignore the robots.txt since I can not index some big sites.
Or if
>>>>> someone
>>>>>>>> could point how to get around it. I read somewhere about
a
>>>>>>>> protocol.plugin.check.robots
>>>>>>>> but that was for nutch.
>>>>>>>>
>>>>>>>> The way I index is
>>>>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>>>>>>>
>>>>>>>> but I can't index the site I'm guessing because of the robots.txt.
>>>>>>>> I can index with
>>>>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>>>>>>>
>>>>>>>> which I am guessing allows it. I was also wondering how to
find the
>>>>> name
>>>>>>> of
>>>>>>>> the crawler bin/post uses.
>>>>>>>
>>>>>
>>>>>
>>
>>
>
>
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Mime
View raw message