lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rosa (Anuncios)" <rosaemailanunc...@gmail.com>
Subject Re: [ANNOUNCE] Web Crawler
Date Wed, 02 Mar 2011 08:36:01 GMT
Nice job!

It would be good to be able to extract specific data from a given page 
via XPATH though.

Regards,


Le 02/03/2011 01:25, Dominique Bejean a écrit :
> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web 
> Crawler. It includes :
>
>    * a crawler
>    * a document processing pipeline
>    * a solr indexer
>
> The crawler has a web administration in order to manage web sites to 
> be crawled. Each web site crawl is configured with a lot of possible 
> parameters (no all mandatory) :
>
>    * number of simultaneous items crawled by site
>    * recrawl period rules based on item type (html, PDF, …)
>    * item type inclusion / exclusion rules
>    * item path inclusion / exclusion / strategy rules
>    * max depth
>    * web site authentication
>    * language
>    * country
>    * tags
>    * collections
>    * ...
>
> The pileline includes various ready to use stages (text extraction, 
> language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or java 
> coding.
>
> With scripting technology, you can help the crawler to handle 
> javascript links or help the pipeline to extract relevant title and 
> cleanup the html pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen 
> shots. All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from 
> here : www.crawl-anywhere.com
>
>
> Regards
>
> Dominique
>
>


Mime
View raw message