incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tony Dietrich" <t...@dietrich.org.uk>
Subject RE: Question re: Ajax processing
Date Mon, 11 Apr 2011 20:56:13 GMT
Err, sorry Fuad, but you are wrong. Even Google disagrees with you /grin! -
they happily crawl my company's website and return fully fleshed out pages
(from their cache!) from our Ajax-based website. However, they don't seem to
be willing to share their secrets /sigh.

I've previously implemented a service using HtmlUnit that does just this.
Although the package isn't intended for the purpose, it worked well. Needed
some manipulation to get it to work well in a multi-threaded environment,
since HTMLUnit isn't thread-safe, but still worked. However, that service
wasn't a web-crawler, just a web-based
download-on-request-parse-and-copy-content service. 

I also have a test-case, based on webSphinx, that makes use of HtmlUnit as
the downloader. However, webSphinx hasn't been maintained for many years and
bringing the code-base up to scratch would take as long as writing my own.

Basically HtmlUnit is a headless browser which includes both a CSS and a
JavaScript processor and which can be queried to return the downloaded (and
finalised) page. It includes the capability to perform asynchronous Ajax
transactions.

I'm not particularly worried about the events that might be triggered by
user-interaction, more the onload() events that cause the page to be fully
initialised with all content, as first seen by a browser user. As used in my
first-mentioned service, it worked a treat.

Implemented as a queue-based, 'multi-windowed' headless browser, the service
overcame the overhead caused by initialising the browser by creating a
singleton instance which was shared between various requests, opening then
processing each requested page in a new 'window', closing the 'window' to
clean up the memory and then returning the result to the query thread. The
individual 'windows' provide a listener event which is triggered on various
events, such as the completion of the page load, and I used this feature to
trigger the return of the result to the query thread, which sat waiting on a
monitor for the event.

I appreciate this is a heavy-weight component, since creating each 'window'
takes quite a lot of time/cycles, and don't expect to be crawling huge
numbers of pages.

If Droids doesn't currently have this capability, is there anyone who can
talk me through (off-list) the process for creating the capability. I'd be
happy to add it back to the code-base if requested, or make the code
available to anyone who needs it on request.


Tony

-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: 11 April 2011 21:21
To: droids-dev@incubator.apache.org
Subject: RE: Question re: Ajax processing

>>>If I'm wrong, can someone point me in the right way to ensure that a
remote crawl of a website will indeed return a fully populated document
whether or not the site uses Ajax/JavaScript to populate elements within the
page after load?

 This is ABSOLUTELY impossible, no one can do it. This is even THEORETICALLY
impossible. Because DOM manipulations are event-driven and unpredictable.
AJAX-based websites can be "crawled" only if these websites generate
search-engine friendly HTML (for instance, if this website is fully
functional even for users without JavaScript disabled). 



Mime
View raw message