Loose String Matching (Levenshtein for Redshift?)

Perhaps a bit tangential to text mining, but has anyone found a way to efficiently do string analysis in Redshift?  Ideally something like levenshtein() or levenshtein_less_distance() in postgres to return string similarity.

There must be a better way than 

where left(frequently_misspelled_text_field, [number]) || '%' ilike left(good_text_field,[number]) || '%'

which is inefficient and not robust. 

This came about previously when I was trending patient data on a state and city basis and had to deal with human-entered messes such as 'calafornia' , 'Yexas' (which wouldn't be caught unless I did the above using right() as well), 'NewYork', 'Massachusets', etc. 

3replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Hi Alok Subbarao  -- I bet you could do this with Python UDFs on the cluster.

    Here are the docs: http://docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html

     

    And a related blog post: https://www.periscopedata.com/blog/redshift-user-defined-functions-python.html

    Reply Like 1
    • Tom O'Neill Thanks! 

      Reply Like
  • Here's what I did today. Found this thread while researching this exact topic for fuzzy string matching.

    create function public.levenshtein(s character varying, t character varying ) returns integer
    stable
    language plpythonu
    as $$
        """
        levenshtein(s, t) -> ldist
        ldist is the Levenshtein distance between the strings
        s and t.
        For all i and j, dist[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
        """
        if s is None and t is None:
            return 0
        if s is None:
            return len(t)
        if t is None:
            return len(s)
    
        rows = len(s)+1
        cols = len(t)+1
        dist = [[0 for x in range(cols)] for x in range(rows)]
        # source prefixes can be transformed into empty strings
        # by deletions:
        for i in range(1, rows):
            dist[i][0] = i
        # target prefixes can be created from an empty source string
        # by inserting the characters
        for i in range(1, cols):
            dist[0][i] = i
    
        for col in range(1, cols):
            for row in range(1, rows):
                if s[row-1] == t[col-1]:
                    cost = 0
                else:
                    cost = 1
                dist[row][col] = min(dist[row-1][col] + 1,      # deletion
                                     dist[row][col-1] + 1,      # insertion
                                     dist[row-1][col-1] + cost) # substitution
    
        return dist[rows-1][cols-1]
    $$
    ;
    

    To test: select public.levenshtein('walk', 'cake');

    Reply Like 1
Like1 Follow
  • 1 Likes
  • 1 yr agoLast active
  • 3Replies
  • 2146Views
  • 4 Following