Loose String Matching (Levenshtein for Redshift?)
Perhaps a bit tangential to text mining, but has anyone found a way to efficiently do string analysis in Redshift? Ideally something like levenshtein() or levenshtein_less_distance() in postgres to return string similarity.
There must be a better way than
where left(frequently_misspelled_text_field, [number]) || '%' ilike left(good_text_field,[number]) || '%'
which is inefficient and not robust.
This came about previously when I was trending patient data on a state and city basis and had to deal with human-entered messes such as 'calafornia' , 'Yexas' (which wouldn't be caught unless I did the above using right() as well), 'NewYork', 'Massachusets', etc.
Hi Alok Subbarao -- I bet you could do this with Python UDFs on the cluster.
Here are the docs: http://docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html
And a related blog post: https://www.periscopedata.com/blog/redshift-user-defined-functions-python.htmlReply
Here's what I did today. Found this thread while researching this exact topic for fuzzy string matching.
create function public.levenshtein(s character varying, t character varying ) returns integer stable language plpythonu as $$ """ levenshtein(s, t) -> ldist ldist is the Levenshtein distance between the strings s and t. For all i and j, dist[i,j] will contain the Levenshtein distance between the first i characters of s and the first j characters of t """ if s is None and t is None: return 0 if s is None: return len(t) if t is None: return len(s) rows = len(s)+1 cols = len(t)+1 dist = [[0 for x in range(cols)] for x in range(rows)] # source prefixes can be transformed into empty strings # by deletions: for i in range(1, rows): dist[i] = i # target prefixes can be created from an empty source string # by inserting the characters for i in range(1, cols): dist[i] = i for col in range(1, cols): for row in range(1, rows): if s[row-1] == t[col-1]: cost = 0 else: cost = 1 dist[row][col] = min(dist[row-1][col] + 1, # deletion dist[row][col-1] + 1, # insertion dist[row-1][col-1] + cost) # substitution return dist[rows-1][cols-1] $$ ;
To test: select public.levenshtein('walk', 'cake');Reply