nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Charset detection algorithm
Date Sat, 06 Nov 2010 19:03:22 GMT
Hi all,

See for a Tika issue  
I'm currently working on, which has to do with the charset detection  

There's the HTML5 proposal, where the priority is

- charset from Content-Type response header
- charset from HTML <meta http-equiv content-type> element
- charset detected from page contents

Reinhard Schwab proposed a variation on the HTML5 approach, which  
makes sense to me; in my web crawling experience, too many servers lie  
to just blindly trust the response header contents.

I've got a slight modification to Reinhard's approach, as describe in  
a comment on the above issue:


I'm interested in comments.


-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message