lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominique Bejean <dominique.bej...@eolya.fr>
Subject Re: [ANNOUNCE] Web Crawler
Date Wed, 02 Mar 2011 14:46:25 GMT
Hi,

No, it doesn't. It looks like to be a apache httpclient 3.x limitation.
https://issues.apache.org/jira/browse/HTTPCLIENT-579

Dominique

Le 02/03/11 15:04, Thumuluri, Sai a écrit :
> Dominique, Does your crawler support NTLM2 authentication? We have content under SiteMinder
which uses NTLM2 and that is posing challenges with Nutch?
>
> -----Original Message-----
> From: Dominique Bejean [mailto:dominique.bejean@eolya.fr]
> Sent: Wednesday, March 02, 2011 6:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: [ANNOUNCE] Web Crawler
>
> Aditya,
>
> The crawler is not open source and won't be in the next future. Anyway,
> I have to change the license because it can be use for any personal or
> commercial projects.
>
> Sincerely,
>
> Dominique
>
> Le 02/03/11 10:02, findbestopensource a écrit :
>> Hello Dominique Bejean,
>>
>> Good job.
>>
>> We identified almost 8 open source web crawlers
>> http://www.findbestopensource.com/tagged/webcrawler   I don't know how
>> far yours would be different from the rest.
>>
>> Your license states that it is not open source but it is free for
>> personnel use.
>>
>> Regards
>> Aditya
>> www.findbestopensource.com<http://www.findbestopensource.com>
>>
>>
>> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
>> <dominique.bejean@eolya.fr<mailto:dominique.bejean@eolya.fr>>  wrote:
>>
>>      Hi,
>>
>>      I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
>>      Web Crawler. It includes :
>>
>>        * a crawler
>>        * a document processing pipeline
>>        * a solr indexer
>>
>>      The crawler has a web administration in order to manage web sites
>>      to be crawled. Each web site crawl is configured with a lot of
>>      possible parameters (no all mandatory) :
>>
>>        * number of simultaneous items crawled by site
>>        * recrawl period rules based on item type (html, PDF, ...)
>>        * item type inclusion / exclusion rules
>>        * item path inclusion / exclusion / strategy rules
>>        * max depth
>>        * web site authentication
>>        * language
>>        * country
>>        * tags
>>        * collections
>>        * ...
>>
>>      The pileline includes various ready to use stages (text
>>      extraction, language detection, Solr ready to index xml writer, ...).
>>
>>      All is very configurable and extendible either by scripting or
>>      java coding.
>>
>>      With scripting technology, you can help the crawler to handle
>>      javascript links or help the pipeline to extract relevant title
>>      and cleanup the html pages (remove menus, header, footers, ..)
>>
>>      With java coding, you can develop your own pipeline stage stage
>>
>>      The Crawl Anywhere web site provides good explanations and screen
>>      shots. All is documented in a wiki.
>>
>>      The current version is 1.1.4. You can download and try it out from
>>      here : www.crawl-anywhere.com<http://www.crawl-anywhere.com>
>>
>>
>>      Regards
>>
>>      Dominique
>>
>>

Mime
View raw message