commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Tobias" <mich...@tobias.org.uk>
Subject [CODEC] Beider Morse Phonetic Matching (BMPM) and Daitch-Mokotoff Soundex (DM)
Date Fri, 13 Jun 2014 16:05:35 GMT
I recently joined this list as I have started to examine Apache Solr and am extremely interested
in using soundex and phonetic tokens.

I have already pointed out some bugs in the current implementation of BMPM in the Commons
Codec and 1 has already been fixed.

Having checked archived messages relating to the introduction of BMPM I see that at the time
it was also discussed whether to implement Daitch-Mokotoff soundex at the same time.  It looks
like this was never taken up but I am really interested in having this functionality.

Daitch-Mokotoff is a much more simple algorithm than BMPM (though it can 'branch' and produce
multiple tokens for the same word). It uses a rules table along with a very few additional
instructions. The algorithm is in the public Domain and there are various implementations
available (including a few apparently written in java but I am not convinced they are correct).
If it is felt necessary I can get written permission from Gary Mokotoff and Randy Daitch to
allow the algorithm to be used.  

I am currently discussing some changes to the algorithm with Gary Mokotoff and hope to have
them agreed shortly.

At that point I will probably have a simple php implementation (not my code, but permission
to adapt will be granted) which I would be interested in having ported to java for inclusion
in the Commons Codec.

Is anybody on this list interested in assisting with this and porting an agreed php implementation
to java?  I will be happy to test all output until we are satisfied it is fully functional.

Thanks

Michael




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message