Text Mining with SQL

One of the recent questions I had to answer focused on analyzing text data. How have you solved this problem in the past? I was not sure how to create an optimized solution for both efficiency and completeness. My solution I came up with used a join condition that had the following query structure:

select
  text_data.id
  , text_data.text
  , keywords.keyword
from
  text_data
  join keywords on
    text_data.text like concat('%', keywords.keyword, '%')

This seemed like a brute force method, but it worked 😜

6replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Redshift natively supports a number of regex functions, although I found regex difficult to pick up. That said, regex can be very useful, and would probably make a great blog post in the future. 

    http://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html

    Reply Like
  • Yeah, regex is definitely one of those things that feels like you are learning it for the first time every time you use it :P

    Britton likes his solution because you can search for an arbitrary number of keywords, which is awesome for some use cases. But with Regex, you can do much more complicated things!

     

    Reply Like
  • Agreed with Taylor's suggestion of using Regex functions.  I've also found a ton of value in tsvectors.  You can remove stop words to do simple analysis like what are the most commonly used words and how many posts do they appear in/how many total time does the word appear. 

    https://www.postgresql.org/docs/8.3/static/textsearch-features.html

    Once you get the Syntax and use cases of tsvectors down you can also use them much in the way you would ilike & like, but they are a more clean, uniform way to do it in my experience.  More on text search here:

    https://www.postgresql.org/docs/9.5/static/textsearch-controls.html

    Reply Like
    • Zach Cooper 

       

      I agree with RedShift Regex solution, but Redshift is more limited than the native sql database . I like first solution because the search is dynamic and can be parametrized the search to do.

      Reply Like
  • I'm not sure how closely this addresses your need because I'm not sure if you want your search to be a fuzzy match or not, but for exact match on string anywhere in string I tend to join on POSITION. I'm often searching for keywords in multiple fields, so it would look something like this: 

    -- keep the reference fields if you want to look at a sample result set to see if you're getting false positives

    SELECT keywords, string_field1, string_field2
    FROM table_of_keywords kw
    LEFT JOIN  table_of_string_fields sf
    ON (POSITION(LOWER(kw.keywords) IN LOWER(sf.string_field1))
      OR  (POSITION(LOWER(kw.keywords) IN LOWER(sf.string_field2))
    
    
    Reply Like
  • To parrot everyone here regular expressions are a great solution. The syntax for them is very specific and a bit daunting. The best sandbox tool I have found for RegEx is RegEx101. You can enter large amounts of text and see if your patterns are right. Make sure to use the PCRE/PHP syntax. Also, check the anchors which can help you find patterns.

     

    If you have a more contextualized use case happy to jam through it.

    Reply Like
Like2 Follow
  • 2 Likes
  • 1 yr agoLast active
  • 6Replies
  • 1428Views
  • 9 Following