# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From David Noel <david.i.n...@gmail.com>
Subject Re: Cleaning dataset prior to clustering
Date Sun, 08 Jun 2014 17:40:03 GMT
```On 6/4/14, David Noel <david.i.noel@gmail.com> wrote:
> On 6/3/14, David Noel <david.i.noel@gmail.com> wrote:
>> Does anyone know of any libraries to clean common strings
>> from a set of strings (Java, preferably)?
>
> I wound up rolling one of my own. It wasn't too bad.

Can't seem to settle on an ideal algorithm. Essentially the problem is
that I have frequently repeated, somewhat similar substrings appended
and prepended to each news article. Because I know the strings appear
at either the beginning or the end of the article I'm able to reduce
the search space somewhat. I've come up with two algorithms. The
question I have is how to select the optimal substring for each
document based on its Levenshtein distance. A rough overview of each
algorithm I've come up with is as follows:

Pull a list of all the domains I've crawled from the database.
For each domain in the result set, pull all of the corresponding
articles for that domain.
.For each article in our result set
..For each next article our result set
...For every substring starting at the beginning of each documents
(tokenized by word, not character.. so each substring consists of one
more word of the document than the last)
....Generate and store the Levenshtein distance for each substring
...End for
...For every generated Levenshtein distance
....Find the furthest position of the most often repeated Levenshtein distance
...End for
..End for
.End for
End for

But there are problems with this algorithm.

Say for example I am comparing two strings (where each character in
the string represents some word):

string x: aaaaabcdeeeefghiiipqr
string y: aaaaajkleeeemnoiiituv

The Levenshtein distances are:

000001233333456666789

The frequency of each Levenshtein distance is:

dist, freq, furthest position
0,5,4
1,1,5
2,1,6
3,5,11
4,1,12
5,1,13
6,4,17
7,1,18
8,1,19
9,1,20

This algorithm would see that the most common frequency was 5, that
the corresponding distance was 3, and that the furthest position was
11. So it would return the substring x[0,11], or "aaaaabcdeeee". But
really we would like the substring "aaaaabcdeeeefghiii".

So what's a different approach? Maybe we could find the furthest
position where the Levenshtein distance is repeated more than once.

Pull a list of all the domains I've crawled from the database.
For each domain in the result set, pull all of the corresponding
articles for that domain.
.For each article in our result set
..For each next article our result set
...For every substring starting at the beginning of each documents
....Generate and store the Levenshtein distance for each substring
...End for
...For every generated Levenshtein distance
....Find the furthest position where the Levenshtein distance is
repeated more than once
...End for
..End for
.End for
End for

Again, our strings are:

string x: aaaaabcdeeeefghiiipqr
string y: aaaaajkleeeemnoiiituv

The Levenshtein distances are:

000001233333456666789

The frequency of each Levenshtein distance is:

dist, freq, furthest position
0,5,4
1,1,5
2,1,6
3,5,11
4,1,12
5,1,13
6,4,17
7,1,18
8,1,19
9,1,20

...and so our algorithm would see that the distance measure of 6 with
a frequency of 4 and last position 17 was the furthest distance of a
frequency greater than 1, so we'd return "aaaaabcdeeeefghiii"

But there are problems with this algorithm too.

Say our strings are:

string x: aaaaabcdeeeefghiiipqrstuvwwdefghi
string y: aaaaajkleeeemnoiiituvwxyzwwabcjkl

The Levenshtein distances are:

0 0 0 0 0 1 2 3 3 3 3 3 4 5 6 6 6 6 7 8 9 10 11 12 13 14 14 15 16 17 18 19 20

This algorithm would see that the distance measure of the string
ending in "ww" (or, "aaaaabcdeeeefghiiipqrstuvww") was the furthest
position in the string with a frequency of distance measure greater
than 1 and return it as the result. But as you can see there is a
whole lot of noise in that string. What we really want is the string
"aaaaabcdeeeefghiii". It's the longest similar string with the least
amount of noise.

So how do we write an algorithm for that?

My first thought is maybe we use some relative measure of the
Levenshtein distance. In other words, we want the longest string where
the Levenshtein distance does not vary by more than a certain amount
from the last most frequent distance measure, where the amount that it
may vary is determined by some ratio of the distance measure to its
position in the string.

This is the point where I don't know how to proceed. Does any of this
make sense? I can't seem to find any established algorithms for fuzzy
or approximate string matching using Levenshtein distances, but I'm
sure someone somewhere has done it before and come up with an optimal
solution for this. Does anyone have any familiarity with this problem?
What is the preferred algorithm for this?

Any thoughts would be appreciated.

This was long. Hope it makes sense.

-David

```
Mime
View raw message