incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: Question re: Ajax processing
Date Mon, 11 Apr 2011 22:24:51 GMT

Tony, extremely simple use case: dynamically assign CSS class to HTML
elements... Search engines operate with plain text retrieved from
HTML/PDF/(even meta-tags of video files)/...; but what you describe here is
"regenerating user screen", "emulating web browser"; it is different use
case (some websites such as www.alexa.com, www.quantcast.com, and even
www.google.com do that for home pages)

Also, "w3c document" is not the same as "SGML"... and "AJAX" is sometimes
workaround around buggy browsers & dynamic CSS, DOJO developers are mostly
worried about IE8 vs Mozilla... and it's not just "dynamic HTML"; it could
be popup window (which can't be defined as single HTML)


This is my extra (mis-)understanding of the discussion:
- we already have computer with browser which can do this
- we already have digital camera which can do snapshot of a screen
- is it related to PLAIN TEXT retrieval for indexing?
- what is UNIQUE "resource identifier" for this plain text, do we have it
for generic AJAX use cases? (we do have it for very naïve "onload()" AJAX)

Just as a sample of technology... web portals can use AJAX to load portlets,
and each portlet have unique URL for "MAXIMIZED" state, and each such
"portal" page lists such URLs for robots - "search engine friendliness"; but
it is very basic AJAX; to convert sophisticated OOP-style JavaScript object
into HTML you need kind of "transformation rules" which are JavaScript, and
you will have to worry about memory leaks of (still mostly buggy) popular
libraries (which are sometimes huge)... what about viruses and threats
inside JavaScript? I can't imagine "polite" robot trying to run AJAX
Sorry if I misunderstood...


Of course... Would be nice to have Apache-library which can generate image
(JPEG) for a homepage!




-----Original Message-----
From: Tony Dietrich [mailto:tony@dietrich.org.uk] 
Sent: April-11-11 4:56 PM
To: droids-dev@incubator.apache.org
Subject: RE: Question re: Ajax processing

Err, sorry Fuad, but you are wrong. Even Google disagrees with you /grin! -
they happily crawl my company's website and return fully fleshed out pages
(from their cache!) from our Ajax-based website. However, they don't seem to
be willing to share their secrets /sigh.

I've previously implemented a service using HtmlUnit that does just this.
Although the package isn't intended for the purpose, it worked well. Needed
some manipulation to get it to work well in a multi-threaded environment,
since HTMLUnit isn't thread-safe, but still worked. However, that service
wasn't a web-crawler, just a web-based
download-on-request-parse-and-copy-content service. 

I also have a test-case, based on webSphinx, that makes use of HtmlUnit as
the downloader. However, webSphinx hasn't been maintained for many years and
bringing the code-base up to scratch would take as long as writing my own.

Basically HtmlUnit is a headless browser which includes both a CSS and a
JavaScript processor and which can be queried to return the downloaded (and
finalised) page. It includes the capability to perform asynchronous Ajax
transactions.

I'm not particularly worried about the events that might be triggered by
user-interaction, more the onload() events that cause the page to be fully
initialised with all content, as first seen by a browser user. As used in my
first-mentioned service, it worked a treat.

Implemented as a queue-based, 'multi-windowed' headless browser, the service
overcame the overhead caused by initialising the browser by creating a
singleton instance which was shared between various requests, opening then
processing each requested page in a new 'window', closing the 'window' to
clean up the memory and then returning the result to the query thread. The
individual 'windows' provide a listener event which is triggered on various
events, such as the completion of the page load, and I used this feature to
trigger the return of the result to the query thread, which sat waiting on a
monitor for the event.

I appreciate this is a heavy-weight component, since creating each 'window'
takes quite a lot of time/cycles, and don't expect to be crawling huge
numbers of pages.

If Droids doesn't currently have this capability, is there anyone who can
talk me through (off-list) the process for creating the capability. I'd be
happy to add it back to the code-base if requested, or make the code
available to anyone who needs it on request.


Tony

-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca]
Sent: 11 April 2011 21:21
To: droids-dev@incubator.apache.org
Subject: RE: Question re: Ajax processing

>>>If I'm wrong, can someone point me in the right way to ensure that a
remote crawl of a website will indeed return a fully populated document
whether or not the site uses Ajax/JavaScript to populate elements within the
page after load?

 This is ABSOLUTELY impossible, no one can do it. This is even THEORETICALLY
impossible. Because DOM manipulations are event-driven and unpredictable.
AJAX-based websites can be "crawled" only if these websites generate
search-engine friendly HTML (for instance, if this website is fully
functional even for users without JavaScript disabled). 



Mime
View raw message