Difference between row_number(), rank(), and dense_rank()
Found a great article that reviews the differences between these 3 useful window functions. I've adapted the theoretical data to provide a more practical (albeit simplified) business context.
All coding credit goes to the authors of the article. Highly recommend checking it out for a more comprehensive explanation of each function!
Suppose you have a list of customers and their respective monthly purchases, like so:
customer_id mon_purch 1 200 2 150 3 200 4 300 5 100 6 400 7 200 8 300
You've been tasked with ranking these customers so that Marketing can prioritize their outreach effectively (e.g. varying offers for future purchases based on last month's purchases).
You could simply write a query to order by mon_purch desc, customer_id and send it to Marketing:
customer_id mon_purch 6 400 4 300 8 300 1 200 3 200 7 200 2 150 5 100
But that solution is lacking for a few reasons.
Firstly, Marketing doesn't have the ability to slice the output based on condition (e.g. 'Let's send an offer to our 20th-50th best customers').
Secondly, as seen above, some customers have made identical purchases last month. You might need a way to handle ties.
Enter row_number(), rank(), and dense_rank()!
Here's how you could rank your customers.
select customer_id , mon_purch , row_number() over (mon_purch desc, customer_id) , rank() over (mon_purch desc, customer_id) , dense_rank() over (mon_purch desc, customer_id) from customer_purchases order by 2 desc, 3
customer_id mon_purch row_number rank dense_rank 6 400 1 1 1 4 300 2 2 2 8 300 3 2 2 1 200 4 4 3 3 200 5 4 3 7 200 6 4 3 2 150 7 7 4 5 100 8 8 5
- Marketing does not consider ties. In fact, assuming your customer IDs are assigned in ascending chronological order, this suggests that Marketing actually favours older customers.
- Tip: if you wanted to favour younger customers, add , 1 desc to your order by clause
- Marketing now treats ties as equal but ranks will be skipped with each change in mon_purch (e.g. there is no rank 3 in the table above). The final rank will be equal to the number of customers in the table.
- Caution: the distribution of data can affect this method. For example, a set where 1 customer spent $400, 13 customers spent $375, and 1 spent $350 would have ranks 1,2, and 15, respectively. You'd have to consider if the customer ranked 15 is actually significantly different from customers ranked 2 when the difference between 2 and 15 ($25) is the same as the difference between 1 and 2 ($25).
- Marketing treats ties as equal but ranks are not skipped with each change in mon_purch. The final rank will depend on the size of the data set and the number of tie breaks.
- This method generally works better with tighter distributions. Always a good idea to explore the data visually first before choosing a ranking method!
So there you have it! Of course, this is just a simplified example. There are many other factors and methods to consider when ranking customers.
Other factors might include customer age, location, and order history.
Other methods for prioritization might include ntile() for binning or lag() to calculate percentage change with a case statement to assign custom rankings based on the outcome. Perhaps in a future post 🙂 .