nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1323) AjaxNormalizer
Date Tue, 11 Mar 2014 10:02:42 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-1323:
---------------------------------

    Attachment: NUTCH-1323-1.8.patch

Updated patch for trunk.

Normalizer now relies on SCOPE_INDEXER, otherwise other rules are tried. This solves the problem
of cumbersome usage. This new patch does not solve the problem of relative URL's. As far as
i know, relative URL's never make it to normalizers anyway. To confirm i did a test crawl
of that http://si.draagle.com/ homepage (with the crazy cookie thing, really, check it out!),
here's the output of readdb.

{code}
Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry interval seconds;Retry
interval days;Score;Signature;Metadata
"http://si.draagle.com/";6;"db_notmodified";Tue Apr 22 11:57:29 CEST 2014;Tue Mar 11 10:55:11
CET 2014;0;3628800.0;42.0;0.0;"c44af84abaf0042685a03bf2ecfd2927";"Content-Type:text/html|||_pst_:success(1),
lastModified=0|||_rs_:25|||"
"http://si.draagle.com/?_escaped_fragment_=/basket/show/";1;"db_unfetched";Tue Mar 11 10:57:32
CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/?_escaped_fragment_=/browse/group/root/";1;"db_unfetched";Tue Mar 11
10:57:32 CET 2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/?_escaped_fragment_=/login/";1;"db_unfetched";Tue Mar 11 10:57:32 CET
2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/draagle_pogoji_uporabe.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET
2014;Thu Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/profiles.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan
01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://si.draagle.com/tvspot.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan
01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.apta-medica.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00
CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.draagle.si/bolezni/index.html";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu
Jan 01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.medicina-danes.si/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00
CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.novartisoncology.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01
01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.orlkotnik.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan 01 01:00:00
CET 1970;0;2592000.0;30.0;0.0;"null";""
"http://www.zobozdravstvolavtar.com/";1;"db_unfetched";Tue Mar 11 10:55:14 CET 2014;Thu Jan
01 01:00:00 CET 1970;0;2592000.0;30.0;0.0;"null";""
{code}

I think this patch is nearly ready. Any other things to worry about?

> AjaxNormalizer
> --------------
>
>                 Key: NUTCH-1323
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1323
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: NUTCH-1323-1.6-1.patch, NUTCH-1323-1.8.patch
>
>
> A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_
URL's and back to an AJAX URL.
> https://developers.google.com/webmasters/ajax-crawling/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message