Text Mining with SQL
One of the recent questions I had to answer focused on analyzing text data. How have you solved this problem in the past? I was not sure how to create an optimized solution for both efficiency and completeness. My solution I came up with used a join condition that had the following query structure:
select text_data.id , text_data.text , keywords.keyword from text_data join keywords on text_data.text like concat('%', keywords.keyword, '%')
This seemed like a brute force method, but it worked 😜
Redshift natively supports a number of regex functions, although I found regex difficult to pick up. That said, regex can be very useful, and would probably make a great blog post in the future.Reply
Yeah, regex is definitely one of those things that feels like you are learning it for the first time every time you use it :P
Britton likes his solution because you can search for an arbitrary number of keywords, which is awesome for some use cases. But with Regex, you can do much more complicated things!
Agreed with Taylor's suggestion of using Regex functions. I've also found a ton of value in tsvectors. You can remove stop words to do simple analysis like what are the most commonly used words and how many posts do they appear in/how many total time does the word appear.
Once you get the Syntax and use cases of tsvectors down you can also use them much in the way you would ilike & like, but they are a more clean, uniform way to do it in my experience. More on text search here:Reply
I'm not sure how closely this addresses your need because I'm not sure if you want your search to be a fuzzy match or not, but for exact match on string anywhere in string I tend to join on POSITION. I'm often searching for keywords in multiple fields, so it would look something like this:
-- keep the reference fields if you want to look at a sample result set to see if you're getting false positives
SELECT keywords, string_field1, string_field2 FROM table_of_keywords kw LEFT JOIN table_of_string_fields sf ON (POSITION(LOWER(kw.keywords) IN LOWER(sf.string_field1)) OR (POSITION(LOWER(kw.keywords) IN LOWER(sf.string_field2))Reply
To parrot everyone here regular expressions are a great solution. The syntax for them is very specific and a bit daunting. The best sandbox tool I have found for RegEx is RegEx101. You can enter large amounts of text and see if your patterns are right. Make sure to use the PCRE/PHP syntax. Also, check the anchors which can help you find patterns.
If you have a more contextualized use case happy to jam through it.Reply