commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Thomas <ma...@apache.org>
Subject Re: [CODEC] Beider Morse Phonetic Matching Bug and questions
Date Wed, 11 Jun 2014 12:29:30 GMT
On 11/06/2014 12:56, Thomas Neidhart wrote:
> Hi,
> 
> as already commented on https://issues.apache.org/jira/browse/CODEC-187 the
> problem is related to some wrongly ported rule files from the original
> source.
> 
> This otoh, creates a serious problem for us, as it looks like that the
> Beider-Morse phonetic matching encoder in commons-codec is derived work
> from a php codebase released under the GPLv3 licence.
> The original codebase is available at http://stevemorse.org/phoneticinfo.htm.
> While investigating the bug and comparing our rule file with the ones from
> the origina codebase it is quite clear that at least these are identical.
> 
> The author of the patch (see https://issues.apache.org/jira/browse/CODEC-125)
> ported the code and applied the Apache license, but the license of the
> original codebase was never considered or discussed.
> 
> This is quite serious I guess, as we have already released the code. We can
> ask the original authors to re-license their code to the Apache Software
> Foundation under a compatible license, but I wonder if they are willing to
> do so.
> This encoder is also used a lot in lucene/solr so it might have even larger
> implications.
> 
> Any ideas how to proceed or if a re-licensing would be sufficient in this
> case?

Re-licensing or permission from the original authors would be
sufficient. If that is not forthcoming then there is no option but to
delete the code.

Replacing any removed code with a 'clean-room' implementation would be
acceptable but in that case the removal of the current code must not
wait for any replacement.

Mark

> 
> Thomas
> 
> 
> On Wed, Jun 11, 2014 at 9:08 AM, Michael Tobias <michael@tobias.org.uk>
> wrote:
> 
>> Does anybody have a working knowledge of the coding of the Beider Morse
>> Phonetic Matching in the Apache Commons Codec?
>>
>>
>>
>> My recent tests using Solr suggest there is a discrepancy between Steve
>> Morse and Alexander Beider's algorithm and the algorithm currently live in
>> Solr (and hence the Commons Codec).
>>
>>
>>
>> I know that the source code for BMPM issued by Steve has changed several
>> times over the years, and I thought at first it might be that the version
>> used in the Commons Codec is an old version that has subsequently been
>> overtaken.  Should the version of the BMPM algorithm not be listed in the
>> Commons Codec documentation? How should version changes to the algorithm be
>> implemented? The algorithm is quite static now so this is probably not so
>> important now but surely it should be DOCUMENTED???
>>
>>
>>
>> My tests now indicate that the discrepancies are NOT a version problem as
>> testing against a very old version 2.00 of the BMPM source code issued on
>> 18
>> June 2009 still exhibits the same problem.
>>
>>
>>
>> Using just a single test term the results are not good. The only saving
>> grace is that the most widely used version is
>>
>>
>>
>> nameType="GENERIC" ruleType="APPROX"
>>
>>
>>
>> and that is a close (but not perfect) match at least for this ONE test
>> word.
>>
>>
>>
>> For the name Abram, all with languageSet="auto"
>>
>>
>>
>> GENERIC APPROX - fails - misses a few tokens
>>
>> Should create tokens: abram abrom avram avrom obram obrom ovram ovrom abran
>> abron obran obron Ybram Ybrom
>>
>> Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron
>> obran obron
>>
>>
>>
>> GENERIC EXACT - good!
>>
>> Should create tokens: abram abran
>>
>> Solr creates: abram abran
>>
>>
>>
>> ASHKENAZI APPROX: - fails dreadfully!
>>
>> Should create tokens: abram abrom avram avrom obram obrom ovram ovrom Ybram
>> Ybrom ombram ombrom imbram imbrom
>>
>> Solr creates: abrAm AvrAm BbrAm
>>
>>
>>
>> ASHKENAZI EXACT: - good!
>>
>> Should create tokens: abram
>>
>> Solr creates: abram
>>
>>
>>
>> SEPHARDIC APPROX: - good!
>>
>> Should create tokens: abram bram abran bran avram vram
>>
>> Solr creates: abram bram abran bran avram vram
>>
>>
>>
>> SEPHARDIC EXACT: - good!
>>
>> Should create tokens: abram abran avram
>>
>> Solr creates: abram abran avram
>>
>>
>>
>> I would appreciate it if somebody with knowledge of the programming of this
>> functionality could investigate.
>>
>>
>>
>> For the worst case I attach here a debug trace of the calculation of the
>> Ashkenazi Approx tokens straight from Steve Morse' implementation. It looks
>> like some of the final rules are not being implemented properly, or at all.
>> The language codes in parenthesis vary from BMPM version to version but the
>> resulting tokens have not changed from version 2.00 up to the current 3.02
>>
>>
>>
>> Thanks
>>
>>
>>
>> Michael
>>
>>
>>
>>
>>
>>
>>
>> applying language rules from (rulesany) to abram using languages 2012
>>
>> char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m
>>
>> applying rule #225
>>    pattern=a
>>    lcontext=
>>    rcontext=[bcdgkpstwzż]
>>    subst=(A|B[128])
>>    result=(A[2012]|B[128])
>>
>> applying rule #229
>>    pattern=b
>>    lcontext=
>>    rcontext=
>>    subst=b
>>    result=(Ab[2012]|Bb[128])
>>
>> applying rule #245
>>    pattern=r
>>    lcontext=
>>    rcontext=
>>    subst=r
>>    result=(Abr[2012]|Bbr[128])
>>
>> applying rule #228
>>    pattern=a
>>    lcontext=
>>    rcontext=
>>    subst=A
>>    result=(AbrA[2012]|BbrA[128])
>>
>> applying rule #240
>>    pattern=m
>>    lcontext=
>>    rcontext=
>>    subst=m
>>    result=(AbrAm[2012]|BbrAm[128])
>>
>> after language rules: (AbrAm[2012]|BbrAm[128])
>>
>>
>> applying final rules from (exactapproxcommon plus approxcommon) to
>> AbrAm[2012]
>> no rules match for phonetic item 0 at position 0: A
>> no rules match for phonetic item 0 at position 1: Ab
>> no rules match for phonetic item 0 at position 2: Abr
>> no rules match for phonetic item 0 at position 3: AbrA
>> no rules match for phonetic item 0 at position 4: AbrAm
>>
>> applying final rules from (exactapproxcommon plus approxcommon) to
>> BbrAm[128]
>> no rules match for phonetic item 1 at position 0: B
>> no rules match for phonetic item 1 at position 1: Bb
>> no rules match for phonetic item 1 at position 2: Bbr
>> no rules match for phonetic item 1 at position 3: BbrA
>> no rules match for phonetic item 1 at position 4: BbrAm
>>
>> applying final rules from (approxany) to AbrAm[2012]
>> after applying final rule #97 to phonetic item #0 at position 0:
>> (a[2012]|o[2012]|Y[16]) pattern=A lcontext= rcontext= subst=(a|o|Y[16])
>> after applying final rule #0 to phonetic item #0 at position 1:
>> (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=b lcontext= rcontext=
>> subst=(b|v[1024])
>> no rules match for phonetic item 0 at position 2:
>> (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r
>> after applying final rule #93 to phonetic item #0 at position 3:
>>
>> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
>> ]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=A lcontext= rcontext=[fklmnprst]$
>> subst=(a|o)
>> no rules match for phonetic item 0 at position 4:
>>
>> (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1024
>> ]|ovro[1024]|Ybra[16]|Ybro[16])m
>>
>> applying final rules from (approxany) to BbrAm[128]
>> after applying final rule #22 to phonetic item #1 at position 0:
>> (o[2012]|om[128]|im[128]) pattern=B lcontext= rcontext=[bp]
>> subst=(o|om[128]|im[128])
>> after applying final rule #0 to phonetic item #1 at position 1:
>> (ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=b lcontext= rcontext=
>> subst=(b|v[1024])
>> no rules match for phonetic item 1 at position 2:
>> (ob[2012]|ov[1024]|omb[128]|imb[128])r
>> after applying final rule #93 to phonetic item #1 at position 3:
>>
>> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
>> ]|imbro[128]) pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
>> no rules match for phonetic item 1 at position 4:
>>
>> (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[128
>> ]|imbro[128])m
>>
>>
>>
>>
>>
>>
>>
>> resulting tokens:
>>
>>
>>
>> (abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombrom|i
>> mbram|imbrom)
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message