lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominique Bejean <dominique.bej...@eolya.fr>
Subject Re: [ANNOUNCE] Web Crawler
Date Wed, 02 Mar 2011 11:28:14 GMT
Hi,

The crawler comes with a extendible document processing pipeline. If you 
know java libraries or web services for 'wrapper induction' processing, 
it is possible to implement a dedicated stage in the pipeline.

Dominique

Le 02/03/11 12:20, Geert-Jan Brits a écrit :
> Hi Dominique,
>
> This looks nice.
> In the past, I've been interested in (semi)-automatically inducing a 
> scheme/wrapper from a set of example webpages (often called 'wrapper 
> induction' is the scientific field) .
> This would allow for fast scheme-creation which could be used as a 
> basis for extraction.
>
> Lately I've been looking for crawlers that incoporate this technology 
> but without success.
> Any plans on incorporating this?
>
> Cheers,
> Geert-Jan
>
> 2011/3/2 Dominique Bejean <dominique.bejean@eolya.fr 
> <mailto:dominique.bejean@eolya.fr>>
>
>     Rosa,
>
>     In the pipeline, there is a stage that extract the text from the
>     original document (PDF, HTML, ...).
>     It is possible to plug scripts (Java 6 compliant) in order to keep
>     only relevant parts of the document.
>     See
>     http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage
>
>     Dominique
>
>     Le 02/03/11 09:36, Rosa (Anuncios) a écrit :
>
>         Nice job!
>
>         It would be good to be able to extract specific data from a
>         given page via XPATH though.
>
>         Regards,
>
>
>         Le 02/03/2011 01:25, Dominique Bejean a écrit :
>
>             Hi,
>
>             I would like to announce Crawl Anywhere. Crawl-Anywhere is
>             a Java Web Crawler. It includes :
>
>               * a crawler
>               * a document processing pipeline
>               * a solr indexer
>
>             The crawler has a web administration in order to manage
>             web sites to be crawled. Each web site crawl is configured
>             with a lot of possible parameters (no all mandatory) :
>
>               * number of simultaneous items crawled by site
>               * recrawl period rules based on item type (html, PDF, …)
>               * item type inclusion / exclusion rules
>               * item path inclusion / exclusion / strategy rules
>               * max depth
>               * web site authentication
>               * language
>               * country
>               * tags
>               * collections
>               * ...
>
>             The pileline includes various ready to use stages (text
>             extraction, language detection, Solr ready to index xml
>             writer, ...).
>
>             All is very configurable and extendible either by
>             scripting or java coding.
>
>             With scripting technology, you can help the crawler to
>             handle javascript links or help the pipeline to extract
>             relevant title and cleanup the html pages (remove menus,
>             header, footers, ..)
>
>             With java coding, you can develop your own pipeline stage
>             stage
>
>             The Crawl Anywhere web site provides good explanations and
>             screen shots. All is documented in a wiki.
>
>             The current version is 1.1.4. You can download and try it
>             out from here : www.crawl-anywhere.com
>             <http://www.crawl-anywhere.com>
>
>
>             Regards
>
>             Dominique
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message