lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukáš Vlček <lukas.vl...@gmail.com>
Subject Re: [ANNOUNCE] Web Crawler
Date Wed, 02 Mar 2011 09:01:22 GMT
Hi,

is there any plan to open source it?

Regards,
Lukas

[OT] I tried HuriSearch, input "Java" into search field, it returned a lot
of references to coldfusion error pages. May be a recrawl would help?

On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean
<dominique.bejean@eolya.fr>wrote:

> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
> Crawler. It includes :
>
>   * a crawler
>   * a document processing pipeline
>   * a solr indexer
>
> The crawler has a web administration in order to manage web sites to be
> crawled. Each web site crawl is configured with a lot of possible parameters
> (no all mandatory) :
>
>   * number of simultaneous items crawled by site
>   * recrawl period rules based on item type (html, PDF, …)
>   * item type inclusion / exclusion rules
>   * item path inclusion / exclusion / strategy rules
>   * max depth
>   * web site authentication
>   * language
>   * country
>   * tags
>   * collections
>   * ...
>
> The pileline includes various ready to use stages (text extraction,
> language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or java coding.
>
> With scripting technology, you can help the crawler to handle javascript
> links or help the pipeline to extract relevant title and cleanup the html
> pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen shots.
> All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from here :
> www.crawl-anywhere.com
>
>
> Regards
>
> Dominique
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message