by Tom Waters
One of the most important functions of the American Iris
Society (AIS) is to carefully evaluate new irises as they grow in gardens and
decide which are worthy of commendation and can be recommended to the gardening
public. This is done through a system of trained garden judges working in all
geographical regions, who evaluate the irises and vote them awards.
I’ve been growing irises on and off since the 1970s, and
served as a judge for many years. There have always been grumblings about the award
system, from simple shaking of the head (“What were the judges thinking?”) to
tales of secret regional cabals working to subvert the process. I’ve not taken
much heed of such complaints, attributing them to a combination of sour grapes
and the ubiquitous human inclination to complain and gossip. Although there are
exceptions, I’m sure, judges I have known personally have all been honest,
conscientious, and reasonably skilled and knowledgeable. They do their very
best to vote for irises they deem truly worthy of recognition.
Nevertheless, I think there is a fundamental structural problem with the process of
voting for AIS awards that keeps some good irises from being recognized and
elevates some mediocre ones to unearned fame.
The awards system asks judges to vote following the model of
a political election: an assortment of eligible candidates are placed on the
ballot, and the judges are to vote for the one(s) they deem best. For this
system to identify the best irises, judges need to be familiar with all or most
of the candidates on the ballot. The rules state that you should not vote for
an iris unless you have seen it growing in a garden (or gardens) over more than
one year. Ideally, the judges should grow the irises themselves. The ideal of
judges intimately familiar with all the candidates is not usually met. Often,
judges have seen only a smattering of the eligible irises (particularly for
early awards, such as honorable mention). They may select the best of those
they are familiar with, but if they are only familiar with 10%, what of the
other 90%?
When there are many names on the ballot, but only a few are
actually seen and evaluated by the judges, the system is very vulnerable to a
particular sort of bias. Not an intentional bias on the part of judges, but a
systemic bias built in to the process: the more widely grown an iris is, the
more likely it is to win awards.
Consider this hypothetical. Assume there are about 400
judges voting. Iris A is bred by a famous hybridizer that many iris growers
order from. It is thus widely distributed and widely grown. 350 of those judges
have seen it growing in a garden. It is a nice iris, but only 10% of the judges
who have seen it think it should win the award. 10% is still 35 judges! Now
consider iris B, introduced through a smaller iris garden that sells only a few
irises each year. Maybe only 20 judges grow iris B. But iris B is
extraordinary! It is so good in every way that 90% of the judges who grow it
think it should win the award! But 90% of 20 judges is just 18, so iris B gets
only about half the votes of iris A, although it is clearly a much better iris.
Note that this undesirable result is not a consequence of
anyone making bad choices, being unethical, or doing anything wrong. The hybridizers, growers, and judges are all doing their best; it’s just the way the numbers play
out.
Another way to look at this phenomenon is to consider the
meaning of a judge voting for an iris or not voting for an iris. Clearly, a
vote for an iris means the judge thought it was the best among those seen. But
what does a judge not voting for an
iris mean? It can mean two very different things: it can mean the judge has
evaluated the iris and found it wanting, or it can simply mean the judge has
not seen the iris. These are two very
different circumstances, and treating them the same is a very bad idea.
In 2019, 378 judges voted for the Dykes Medal, and the iris
that won received only 29 votes. That’s less than 8%. This is nothing new, it
is typical of recent years. What does that mean? It is difficult for the public
to be confident that this is the best iris of the year, when we don’t know what
the other 349 judges thought of it. Did they love it, but just slightly
preferred another iris over it? Did they think it was bad? Did they just not
see it? Such ambivalent results are a direct consequence of using an election
model with a long list of candidates, many of which are not familiar to most of
the judges.
There is a way to address this structural bias. If we moved
from an election model to a rating model, we could much more
accurately identify the worthiest irises. A rating model is what is commonly
used for reviews of products, businesses, restaurants, and so on. Everyone who
is familiar with the product gives it a rating, and the average of those
ratings is what helps future consumers decide whether the product is worthy or
not.
How would a rating system for irises work? It would not have
to be as elaborate as the 100-point scoring systems presented in the judges’
handbook. A rating from 1-10 would do just fine, or even a scale of 1-5 stars,
like you often see in other product ratings.
Consider our two hypothetical irises again. Assume that
judges who vote the iris worthy of the award rate it at 5 stars, and those who have seen it but do not vote for it rate at 3 stars. Iris A, which 350 have seen but only 10%
vote for, would have an average rating of (315 x 3 + 35 x 5)/350 = 3.2. Iris B,
which only 20 judges have seen but 90% vote for, would have an average rating
of (2 x 3 + 18 x 5)/20 = 4.8. Iris B is the clear winner, as I think it should
be.
In this system, judges would enter a rating for every iris
they have evaluated. They would not have to pick the single best one to receive
an award. They could rate any number of irises highly, and if they saw some
with serious faults, they could give them low ratings, which would bring the
average rating down and make it much less likely for these poorer irises to win
awards, no matter how widely grown they are.
Judges would not enter a rating for irises they had not
evaluated. So their not having seen it would not penalize the iris, since it
would not affect its average rating at all. A non-rating (from not having seen
the iris) would have a very different consequence from a low rating (the judge
evaluated the iris and found it unworthy).
If such a system were implemented, some additional
considerations would probably have to come into play. We might want the iris to
be rated by some minimum number of judges before we would trust the average and
give it an award, for example. We could also use this system to check for
consistent performance in geographical areas, if that were deemed desirable. We
could also demand a certain minimum average rating (say 4, perhaps), so that if
no candidate iris were rated very highly, no award would be given.
Under the current system, I think the training and skill of
the judges is largely wasted. They evaluate many irises over the course of the
year, and form opinions about each one. That information is lost when they are
instructed to simply vote for the best one. Every time a judge rates an iris
favorably, its chance of receiving an award should go up; every time a judge
rates an iris unfavorably, its chance should go down. Not being seen should not
be a penalty.
A rating system would also encourage new hybridizers, as it
would give us a way to recognize really exceptional irises that aren’t introduced
through the big growers. It would allow hybridizers to build their reputation
by receiving awards for quality work, rather than receiving awards because of
an established reputation. Established hybridizers would not be much hurt by
such a change; they still have the advantage of large, extended breeding
programs and experience in recognizing quality seedlings. They don’t need the
additional advantage of distribution bias to have a fair chance at awards.
I hope this post stimulates some discussion on the topic of
our awards system and the consequences of structuring it as we have. I see the
potential to improve the system in a way that makes it more fair to all new
irises, more useful and credible with the gardening public, more supportive of
new hybridizers, and more conscientious in reflecting the careful evaluation work
of our judges.