Monday, November 25, 2019

What’s Wrong with the AIS Awards System


by Tom Waters

One of the most important functions of the American Iris Society (AIS) is to carefully evaluate new irises as they grow in gardens and decide which are worthy of commendation and can be recommended to the gardening public. This is done through a system of trained garden judges working in all geographical regions, who evaluate the irises and vote them awards.

I’ve been growing irises on and off since the 1970s, and served as a judge for many years. There have always been grumblings about the award system, from simple shaking of the head (“What were the judges thinking?”) to tales of secret regional cabals working to subvert the process. I’ve not taken much heed of such complaints, attributing them to a combination of sour grapes and the ubiquitous human inclination to complain and gossip. Although there are exceptions, I’m sure, judges I have known personally have all been honest, conscientious, and reasonably skilled and knowledgeable. They do their very best to vote for irises they deem truly worthy of recognition.

Nevertheless, I think there is a fundamental structural problem with the process of voting for AIS awards that keeps some good irises from being recognized and elevates some mediocre ones to unearned fame.

The awards system asks judges to vote following the model of a political election: an assortment of eligible candidates are placed on the ballot, and the judges are to vote for the one(s) they deem best. For this system to identify the best irises, judges need to be familiar with all or most of the candidates on the ballot. The rules state that you should not vote for an iris unless you have seen it growing in a garden (or gardens) over more than one year. Ideally, the judges should grow the irises themselves. The ideal of judges intimately familiar with all the candidates is not usually met. Often, judges have seen only a smattering of the eligible irises (particularly for early awards, such as honorable mention). They may select the best of those they are familiar with, but if they are only familiar with 10%, what of the other 90%?

When there are many names on the ballot, but only a few are actually seen and evaluated by the judges, the system is very vulnerable to a particular sort of bias. Not an intentional bias on the part of judges, but a systemic bias built in to the process: the more widely grown an iris is, the more likely it is to win awards.

Consider this hypothetical. Assume there are about 400 judges voting. Iris A is bred by a famous hybridizer that many iris growers order from. It is thus widely distributed and widely grown. 350 of those judges have seen it growing in a garden. It is a nice iris, but only 10% of the judges who have seen it think it should win the award. 10% is still 35 judges! Now consider iris B, introduced through a smaller iris garden that sells only a few irises each year. Maybe only 20 judges grow iris B. But iris B is extraordinary! It is so good in every way that 90% of the judges who grow it think it should win the award! But 90% of 20 judges is just 18, so iris B gets only about half the votes of iris A, although it is clearly a much better iris.

Note that this undesirable result is not a consequence of anyone making bad choices, being unethical, or doing anything wrong. The hybridizers, growers, and judges are all doing their best; it’s just the way the numbers play out.

Another way to look at this phenomenon is to consider the meaning of a judge voting for an iris or not voting for an iris. Clearly, a vote for an iris means the judge thought it was the best among those seen. But what does a judge not voting for an iris mean? It can mean two very different things: it can mean the judge has evaluated the iris and found it wanting, or it can simply mean the judge has not seen the iris. These are two very different circumstances, and treating them the same is a very bad idea.

In 2019, 378 judges voted for the Dykes Medal, and the iris that won received only 29 votes. That’s less than 8%. This is nothing new, it is typical of recent years. What does that mean? It is difficult for the public to be confident that this is the best iris of the year, when we don’t know what the other 349 judges thought of it. Did they love it, but just slightly preferred another iris over it? Did they think it was bad? Did they just not see it? Such ambivalent results are a direct consequence of using an election model with a long list of candidates, many of which are not familiar to most of the judges.

There is a way to address this structural bias. If we moved from an election model to a rating model, we could much more accurately identify the worthiest irises. A rating model is what is commonly used for reviews of products, businesses, restaurants, and so on. Everyone who is familiar with the product gives it a rating, and the average of those ratings is what helps future consumers decide whether the product is worthy or not.

How would a rating system for irises work? It would not have to be as elaborate as the 100-point scoring systems presented in the judges’ handbook. A rating from 1-10 would do just fine, or even a scale of 1-5 stars, like you often see in other product ratings.

Consider our two hypothetical irises again. Assume that judges who vote the iris worthy of the award rate it at 5 stars, and those who have seen it but do not vote for it rate at 3 stars. Iris A, which 350 have seen but only 10% vote for, would have an average rating of (315 x 3 + 35 x 5)/350 = 3.2. Iris B, which only 20 judges have seen but 90% vote for, would have an average rating of (2 x 3 + 18 x 5)/20 = 4.8. Iris B is the clear winner, as I think it should be.

In this system, judges would enter a rating for every iris they have evaluated. They would not have to pick the single best one to receive an award. They could rate any number of irises highly, and if they saw some with serious faults, they could give them low ratings, which would bring the average rating down and make it much less likely for these poorer irises to win awards, no matter how widely grown they are.

Judges would not enter a rating for irises they had not evaluated. So their not having seen it would not penalize the iris, since it would not affect its average rating at all. A non-rating (from not having seen the iris) would have a very different consequence from a low rating (the judge evaluated the iris and found it unworthy).

If such a system were implemented, some additional considerations would probably have to come into play. We might want the iris to be rated by some minimum number of judges before we would trust the average and give it an award, for example. We could also use this system to check for consistent performance in geographical areas, if that were deemed desirable. We could also demand a certain minimum average rating (say 4, perhaps), so that if no candidate iris were rated very highly, no award would be given.

Under the current system, I think the training and skill of the judges is largely wasted. They evaluate many irises over the course of the year, and form opinions about each one. That information is lost when they are instructed to simply vote for the best one. Every time a judge rates an iris favorably, its chance of receiving an award should go up; every time a judge rates an iris unfavorably, its chance should go down. Not being seen should not be a penalty.

A rating system would also encourage new hybridizers, as it would give us a way to recognize really exceptional irises that aren’t introduced through the big growers. It would allow hybridizers to build their reputation by receiving awards for quality work, rather than receiving awards because of an established reputation. Established hybridizers would not be much hurt by such a change; they still have the advantage of large, extended breeding programs and experience in recognizing quality seedlings. They don’t need the additional advantage of distribution bias to have a fair chance at awards.

I hope this post stimulates some discussion on the topic of our awards system and the consequences of structuring it as we have. I see the potential to improve the system in a way that makes it more fair to all new irises, more useful and credible with the gardening public, more supportive of new hybridizers, and more conscientious in reflecting the careful evaluation work of our judges.