Percentile Curves - Another way to show distributions

Note: This solution requires Periscope's Python/R integration

Let's say you have data set that is heavily skewed, like this hypothetical data set here plotting the amount users paid on a fictional gaming app.

 As we can see here, a bar chart isn't a great way to visualize the data. Most users fall into the 99 cent bucket, so it is hard to see how many users fall into the other smaller buckets skewed to the right of the chart.

There's a couple ways we can re-display the data.

  1. Via a histogram that buckets together certain ranges. But let's say we don't want to lose any granularity. Hop on down to option (2)
  2. Plotting the percentile on the x axis, and the value that corresponds to the percentile on the y axis

Option (2) would look something like this:

Now, we can easily see that well over half of users have paid 0.99 cents, and the higher paying customers make the top 20% of the data.

Here's the Python 3.6 code on how to get that information. Assume the information with the user ID and the amount they paid is stored in a dataframe called df.

# SQL output is imported as a pandas dataframe variable called "df"
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import scoreatpercentile

a=list(range(1,101))

b = [scoreatpercentile(df["amt_paid"],i) for i in a]

df2 = pd.DataFrame({'percentile': a, 'value': b}, columns=['percentile', 'value'])

# Use Periscope to visualize a dataframe, text, or an image by passing data to periscope.table(), periscope.text(), or periscope.image() respectively.
periscope.table(df2)
3replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Hi Neha Kumar ,

    thanks for the very useful tutorial!

    Could you please elaborate on:

    Assume the information with the user ID and the amount they paid is stored in a dataframe called df.

    What does the SQL Output that constitutes the df look like, in terms of structure?

    I'm getting results that don't quite look right (values are too high and linear) with my own data.

    Reply Like
      • Neha Kumar
      • Solutions Engineer
      • Neha_Kumar
      • 3 mths ago
      • 1
      • Reported - view

      Tom Nemec Of course, here's the SQL output to create df in the script above. The base table was a record of all transactions made by each user_id. To create this SQL output, I queried 

      select user_id, sum(price) as amt_paid from purchases group by 1

       

      Hope this helps!

      Reply Like 1
      • Tom Nemec
      • Tom_Nemec
      • 3 mths ago
      • 1
      • Reported - view

      Perfect, thanks. I was under the mistaken impression that I would need to pre-aggregate the base table. Now everything is working as expected.

      Reply Like 1
Like2 Follow
  • 2 Likes
  • 2 mths agoLast active
  • 3Replies
  • 1056Views
  • 2 Following