lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geert-Jan Brits <gbr...@gmail.com>
Subject Re: [ANNOUNCE] Web Crawler
Date Wed, 02 Mar 2011 11:20:32 GMT
Hi Dominique,

This looks nice.
In the past, I've been interested in (semi)-automatically inducing a
scheme/wrapper from a set of example webpages (often called 'wrapper
induction' is the scientific field) .
This would allow for fast scheme-creation which could be used as a basis for
extraction.

Lately I've been looking for crawlers that incoporate this technology but
without success.
Any plans on incorporating this?

Cheers,
Geert-Jan

2011/3/2 Dominique Bejean <dominique.bejean@eolya.fr>

> Rosa,
>
> In the pipeline, there is a stage that extract the text from the original
> document (PDF, HTML, ...).
> It is possible to plug scripts (Java 6 compliant) in order to keep only
> relevant parts of the document.
> See
> http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage
>
> Dominique
>
> Le 02/03/11 09:36, Rosa (Anuncios) a écrit :
>
>  Nice job!
>>
>> It would be good to be able to extract specific data from a given page via
>> XPATH though.
>>
>> Regards,
>>
>>
>> Le 02/03/2011 01:25, Dominique Bejean a écrit :
>>
>>> Hi,
>>>
>>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
>>> Crawler. It includes :
>>>
>>>   * a crawler
>>>   * a document processing pipeline
>>>   * a solr indexer
>>>
>>> The crawler has a web administration in order to manage web sites to be
>>> crawled. Each web site crawl is configured with a lot of possible parameters
>>> (no all mandatory) :
>>>
>>>   * number of simultaneous items crawled by site
>>>   * recrawl period rules based on item type (html, PDF, …)
>>>   * item type inclusion / exclusion rules
>>>   * item path inclusion / exclusion / strategy rules
>>>   * max depth
>>>   * web site authentication
>>>   * language
>>>   * country
>>>   * tags
>>>   * collections
>>>   * ...
>>>
>>> The pileline includes various ready to use stages (text extraction,
>>> language detection, Solr ready to index xml writer, ...).
>>>
>>> All is very configurable and extendible either by scripting or java
>>> coding.
>>>
>>> With scripting technology, you can help the crawler to handle javascript
>>> links or help the pipeline to extract relevant title and cleanup the html
>>> pages (remove menus, header, footers, ..)
>>>
>>> With java coding, you can develop your own pipeline stage stage
>>>
>>> The Crawl Anywhere web site provides good explanations and screen shots.
>>> All is documented in a wiki.
>>>
>>> The current version is 1.1.4. You can download and try it out from here :
>>> www.crawl-anywhere.com
>>>
>>>
>>> Regards
>>>
>>> Dominique
>>>
>>>
>>>
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message