Adjusting ratings based on raters' tendencies

In our system, we have students and teachers. Some students retain well and are satisfied with our service but perpetually give their lessons low ratings (1 or 2 on a 5 point scale). 

 

We'd like to reward teachers that receive high ratings but also want to control for cases where a student consistently rates their lessons 1 or 2 out of 5 thus dragging down the teacher's average (and median) lesson rating.

 

Does anyone know how this is usually handled?

8replies Oldest first
  • Oldest first
  • Newest first
  • Active threads
  • Popular
  • Max Fridman Have you thought about rewarding for "points above median rating" instead of absolute points? Over time you can get better about predicting what a student should rate a class and reward "points above expected rating". 

    These approaches are nice in the abstract but you'll have the challenge of explaining it to teachers. If that's a big challenge for you, you can do something more naive but easier to explain. For example, take a global average of class ratings and reward teachers whose ratings beat the average. Or make a leaderboard and reward the top teacher each month. 

    Finally, all of these approaches incentivize gaming student ratings, but I'm guessing you know more about that than I do. 😉

    Reply Like
    • Harry Glaser Thanks for the answer!

      I've thought about doing something to that effect but have run into some issues with the execution. All, or almost all, our teachers have average/median ratings between 4 and 5 each month. As such, using something like what you suggested creates a binary system for most ratings where a rating of 5 is good since it is above the average/median and 4 is bad since it is below. That might not be a bad thing but it felt wrong when I reviewed the results. I'll try to revisit this approach and see if I can get better results.

      Generally speaking, I'm more concerned with rewarding teachers that actually deserve it than having an easy to explain system. And, even if we put aside the incentive discussion, I'd like to know operationally which teachers are our best performers and aren't just getting lucky with which students book them.

      Reply Like
    • Max Fridman "5 is good and 4 is bad" basically describes Uber's and Lyft's ratings systems, and they seem to work pretty well. I'd go with that or a simple stack rank if you're looking to motivate teachers. 

      If you're looking for your own operational metric, I might go with "basis points above predicted rating" where predicted rating is the rating you'd expect from that student mix in a baseline class. It's a bit inspired by "wins above replacement player" that's used by pro basketball teams.

      Let us know what you go with! Very curious to hear how this works out.

      Reply Like
  • One thing you could try is to create a ratio score similar to net promoter score. Treat a 5 as a positive, a 4 as a neutral and a 1 - 3 as a negative. Ratios tend to smooth out idiosyncratic results a bit. But they are also hard to explain.

    select
    teacher
    , sum(case when [score] = 5 then 1
          when [score] = 4 then  0
          else -1 end) / count(*) teacher_promoter_ratio
    from [table_of_scores]

    Another method would be to filter out repeat low scorer scores. Perhaps with something like the following pseudocode that would filter out any students with more than 3 reviews and an average review score of 2 or less. This feels more arbitrary, but it is very easy to explain - "We filter out serial low scorers", and reasonable.

    select
      teacher
      , avg(score)
    from [table_of_scores]
    where student in (
    select student from [table_of_scores] group by 1 having count(*) <= 3 or avg(score) > 2
    )
    group by 1
    Reply Like
  • First instinct is to normalize each student against themselves, and then calculate ratings for teachers after that? I'm imagining a student that only rates 1s and 2s, with their "2s" being effectively equivalent to another students 5s. That doesn't seem exactly right though.

    Check out this blog post, I'm not sure it fits your needs either, but I could imagine a solid implementation of this getting you closer to the truth of how good teachers _actually_ are.

    https://www.evanmiller.org/how-not-to-sort-by-average-rating.html

    To the point about "dragging down" ratings, it sort of depends on who the rating is for. If it's for prospective students, and that population of prospective students contains an average amount of "1s and 2s" raters, then a simple average + histogram will give them a sense of their true propensity to also rate it a 1 or a 2. If you're looking for a fair way to rank / compensate teachers, then it seems like across a high enough sample size, the 1s and 2s raters would balance out. If the problem is low sample size and confidence in early ratings, then I think the blog post above might work for you.

    Reply Like
  • It seems to me that a rating given by a student does not reflect their likelihood to retain or be satisfied with the service.

    To answer your question though, if we assume the student rating distribution to be normal, you can evaluate the average rating per student and look at the number of standard deviations from the mean.

    Reply Like
  • For dealing with students who habitually rate low, I wonder if you could create rater scores and down-weight those with consistently low scores or low avg scores with low variability in scores. If your intent is to make scores count more for those who appear to rate honestly and to thoughtfully, then weighting something like variability is probably a good idea and will catch both the students who always rate low and those who always rate high.

    Reply Like
    • Datagub this was my thought as well. In addition to a teacher (raw) mean or median rating, a secondary adjusted ranking which keeps track of each student's ratings would be useful. If, for example, a student has a mean rating is below 2, you can underweight their impact on the teacher score.

      A student with a mode rating score of 1 (and with no information entered into the 'free text' field which these surveys generally contain, as we don't want to screen out students who are giving 1s but who have legitimate grievances) should almost certainly have their ratings underweighted. 

      Reply Like
Like1 Follow
  • 1 Likes
  • 1 yr agoLast active
  • 8Replies
  • 544Views
  • 6 Following