Pivoting large data sets with dynamic column names
I frequently get requests that require me to reshape summary data so that months are presented as columns. In most cases, I also I need the dates/months included to be dynamic, such as a rolling twelve month window.
Before the R/Python functionality was added, I'd typically use case statements to do this and I'd have to manually update my case statements each month (or when someone let me know that it was a few months out of date). Once R/Python was added I started using R to pivot the sql results similar to what's shown in this post, which works great in most cases: https://community.periscopedata.com/t/q5gk76/pivoting-and-melting-dataframes
Recently I ran into a problem where the initial SQL output that I needed to pivot was too large for Periscope to display, so I couldn't apply the typical solution. I wanted to share the workaround I ended up using to get past this issue.
I've generalized the query and R script to use the gameplay and user tables. In this example, the goal output is a single row per user with some identifying information about the user as well as a monthly count of game plays for a three month period. In my real life case, this three month period would be rolling based on the current date and would need to be dynamic.
with t1 as ( -- This CTE would typically be the entire query I would be using -- And then would pivot this output with R select user_id , users.first_name , users.last_name , users.email , users.platform , users.source , [gameplays.created_at:month] as month , count(gameplays.*) as gameplays from gameplays left join users on gameplays.user_id = users.id where [gameplays.created_at:month] > [dateadd(month,-3,'2016-12-01'):month] -- if rolling based on current date: -- [gameplays.created_at:month] > [dateadd(month, -3, now()):month] and user_id < 580000 group by 1,2,3,4,5,6,7) , months as ( -- Assign a row number to each of the months select row_number() over (order by month) as row_num , month from ( select distinct month from t1)) select t1.user_id , t1.first_name , t1.last_name , t1.email , t1.platform , t1.source , max(case when row_num=1 then gameplays end) as "1" , max(case when row_num=2 then gameplays end) as "2" , max(case when row_num=2 then gameplays end) as "3" -- these two rows are added to be used in R script will be removed from output , max(case when row_num=1 then months.month end) as starting_date , max(row_num) as num_rows from t1 left join months on t1.month = months.month group by 1,2,3,4,5,6
In the SQL example above, the query in the t1 CTE is similar to what I would normally have as my entire SQL query. However, if you were to run just that part of the query, you'll get a "Result set too large" error message. So I'm going to go back to writing case statements to pivot the months to columns and reduce the number of rows. But I still want it to be dynamic, so that I don't have to manually update the case statements each month.
To do this, I added an additional CTE (months) that assigns a row number to each of the months. Then I join this with the first CTE t1. Now when I write my case statements, I can use the generic row_num as opposed to the month. The sql output at this point looks something like this.
Now I can use R to rename those columns with the correct date/name. In the SQL query, I included the fields starting_date and num_rows to be used by the R script to make sure that we're creating the correct names. It uses those output fields to create a vector with the correctly formatted months as text to be used for column names, and then renames the columns from "1", "2", "3" to "2016-10-01", "2016-11-01", "2016-12-01" respectively. Once I drop the starting_date and num_rows columns, I have the desired output!
library(lubridate) # define the number of column months that need to be renamed # uses the num_rows field from sql output num_months <- max(df$num_rows, na.rm=TRUE) # start date is defined based on the starting_date field from sql output # end date is calculated based on start_date and num_months start_date <- as.Date(min(df$starting_date, na.rm=TRUE)) end_date <- start_date + months(num_months-1) # create a vector with the months to use as names months <- as.character(ymd(seq(start_date, end_date, "months"))) # remove the num_rows and starting_date fields from output df$num_rows <- NULL df$starting_date <- NULL # rename columns of original dataframe with the dates formed above colnames(df)[colnames(df) %in% seq(1:num_months)] <- months[1:num_months] # return dataframe periscope.table(df)
Hope this helps if you find yourself running into the same issue.