nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: LanguageIdentifier refactoring
Date Tue, 05 Jul 2005 13:02:40 GMT
Jerome,

I have an issue with the language detection plugin, which I'm not sure 
how to address. The plugin first tries to extract the language 
identifier from meta tags. However, meta tag values people put there are 
  often completely wrong, or follow obscure pseudo-standards.

Example: there is a bunch of pages, generated by Frontpage, where author 
apparently forgot to change the default settings. So, the meta tags say 
"en-us", while the real content of the page is in Spanish. The 
identify() method shows this clearly.

The final value put in X-meta-lang is "en-us". Now, the question is - 
should the plugin override that value with the one from the 
auto-detection? This means that it should always run the detection 
step... Can we have more confidence in our detection mechanism than in 
the author's knowledge? Well, perhaps, if for content longer than xxx 
bytes the detection is nearly unambiguous.

Another example: for a bunch of pages in Swedish, I collected the 
following values of X-meta-lang:

(SCHEME=ISO.639-1) sv
(SCHEME=ISO639-1) sv
(SCHEME=RFC1766) sv-FI
(SCHEME=Z39.53) SWE
EN_US, SV, EN, EN_UK
English Swedish
English, swedish
English,Swedish
Other (Svenska)
SE
SV
SV charset=iso-8859-1
SV-FI
SV; charset=iso-8859-1
SVE
SW
SWE
SWEDISH
Sv
Sve
Svenska
Swedish
Swedish, svenska
en, sv
se
se, en
se,en,de
se-sv
sv
sv, be, dk, de, fr, no, pt, ch, fi, en
sv, dk, fi, gl, is, fo
sv, dk, no
sv, en
sv, eng
sv, eng, de
sv, fr, eng
sv, nl
sv, no, de
sv, no, en, de, dk, fi
sv,en
sv,en,de,fr
sv,eng
sv,eng,de,fr
sv,no,fi
sv-FI
sv-SE
sv-en
sv-fi
sv-se
sv; Content-Language: sv
sv_SE
sve
svenska
svenska, swedish, engelska, english, norsk, norwegian, polska, polish
sw
swe
swe.SPR.
sweden
swedish
swedish,
text/html; charset=sv-SE
text/html; sv
torp, stuga, uthyres, bed & breakfast


In all cases the value from the detection routine was unambiguous - swedish.

In this light, I propose the following changes:

* modify the identify() method to return a pair of lang code + relative 
score (normalized to 0..1)

* in HTMLLanguageParser we should always run 
LanguageIdentifier.identify(parse.getText())

* if the meta tag is null, we take the value from identify()

* if the value from identify() is null, we take the meta tag value.

* if the meta tag is not null and the value from identify() is not null:

	* if the content is shorter than "lang.analyze.max.length",
	  we take the meta tag value

	* else, if the meta tag and identify values are different:

		* if the score from identify() is above "certainty"
		  threshold (0.8?), we take the value from identify().

		* elsee, we take the meta tag value.

Similar changes would be needed in LanguageIndexingFilter.filter(), to 
handle text coming from other content types.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message