Disclaimer: The opinions expressed on that webpage are not necessarily those of the Department of Mathematical Sciences, or the University of Stellenbosch.
 

Suggestions for improvements of the
rating system of the National Research Foundation (NRF)

 

South Africa is the only country in the world that attempts to individually rate its researchers (as opposed to e.g. the UK where whole Departments are rated). It doesn't take a rocket scientist to realize that members of the same Math Department evaluating each other isn't pushing objectivity to its extreme. In the sequel I list the traps in more detail, and propose a solution.

 In a nutshell, evaluation works like this. The so called Specialist Committee (SC), usually comprising 4 scientists working in the candidate's field, confers one of these eight ratings to the candidate:

A1 = world leader  >  A2  >  B1  >  B2  >  B3  >  C1  >  C2  >  C3 ; here is the "precise" definition (cf  4b) of the rating categories.

The SC reaches its conclusion after having read the reports of international reviewers. Typically 5 reviewers are taken from the candidate's list, and 5 are chosen by the SC. There are many pitfalls in this endeavour. Here are the most upsetting ones for me (based on personal experiences):

 

1 The Appeals process. If the candidate feels the rating confered by the SC is inappropriate, he can address the so called Appeals Committee (AC). Yet the AC may not contain any scientist (say mathematician) in the applicant's field, let alone an independent mathematician, preferably one from abroad but at least not known by the SC. Similar remarks concern the Executive Evaluation Committee (EEC).  For instance, the publication rate in biology is much higher than in mathematics. So what is the likely opinion of a biologist with 124 co-authored articles (almost identical judging from the minuscule title variations) on mole rats about a mathematician with a mere 40 (rather diverse, single-authored) publications? Sometimes the unsatisfied candidate can go through a full scale re-evaluation earlier than after the usual 5-year period. In this case he confronts again a specialist committee (SC). Unfortunately, the second SC is not independent from the first SC and not keen to discredit its rating. In fact, each year at most two out of four people in the SC change.

 The other extreme is that informal objections of our Director of Research have raised ratings of unhappy candidates within a week and without any administrational hassle. It also happens that X and Y having served in the same SC, when X quits he receives a flattering rating by the successor SC chaired by Y.

 

2 Reviewers feedback Closely related to a potential appeal is of course the feedback one receives from the reviewers. At present, the NRF only provides very few and short (often cherry picked) excerpts from the reviewers reports which may not at all reflect  the general tone of the reviewers; the bias can go in either direction. I understand that providing at once all reviewer reports as complete as possible (merely ensuring the anonymity of the reviewers), may result in offending or even humiliating applicants. But if the applicant by whatever reason insists on a appeal, he will be prepared to take that punch.

 

 

3 The current rating process. The biggest problem  is the potential bias of the SC for or against the candidate;  this is almost unavoidable when the SC contains members of the same or an adjacent department as the candidate. Let us cite from this recent article which is interesting reading in many other aspects as well:

‘the problem of subjective
judgments seeping into the rating
process was also raised more generically,
affecting the natural sciences along with
other disciplines, as a function of the
inevitable prejudices and biases which
shape judgments of peers, often subliminally—
a problem which all rating systems
have to confront.'

Another problem is the fact that the Specialist Committee despite the name often does not contain a specialist working in X's field. For instance, the SC for the Mathematical Sciences can contain two mathematicians, one physicist, one computer scientist, and even the two mathematicians may work in a field distant from X's speciality. Below we point out three more specific shortcomings 3a,3b,3c of the current system and indicate partial cures. A more radical cure is outlined in section 4.


3a. On the NRF webpage "Evaluation and Rating"  one finds these guidelines for the assessment of reviewer reports. As to guideline number 6, conveniently stated last, this rubber paragraph gives the panel a free hand to decide whether any report is "biased" and hence can be ignored. Because this paragraph promotes the mutation of research administrators into research administraitors, I propose it only be applicable if one or two reports drastically differ (in either direction) from the rest, since then with some probability they are indeed inappropriate. Vice versa, if a lot of reports (including several from reviewers picked by the SC) point into one direction, it is unlikely that they all are biased.

3b As to guideline number 1 in the assessment of reviewer reports (read it),  it seems fair enough, but on second thought two quibbles come to mind:

    (i) How many reviewers Y actually read in detail the research outputs of applicant X?
    (ii) If Y has indeed done his job, what is his likely verdict?

    As to (i), not a lot. Unless of course Y is a co-author of X. Presently co-authors are allowed as reviewers of an applicant, and neither are they excluded from the application of guideline 1. If Y is not a co-author and has nevertheless read X's article A in detail, then A likely was easy reading and its scientific content presumably low. As to (ii), if Y happens to be a co-author of A, then of course the verdict is sublime, since A is Y's paper as well. If Y is not a co-author of A but read it in detail, then A likely was easy reading. To summarize, in the current system, X's scientifically deepest papers A are unlikely to be read in detail, except by his biased co-authors. Bad luck if X single-authored A.

3c This leads us to the most puzzling problem of all: How is X's worth to be assessed if he predominantely produces co-authored articles? The current self-assessment of X's contribution in each of his (main) co-authored articles A in the evaluation period is  inaccurate. Even without bad intentions, X is inclined to overestimate his contribution. I am told that this would "likely" be discovered by some reviewer that happened to be a co-author in that article. Admittedly, such a discovery could be detrimental to X's rating. But how likely is it really to detect a cheating X if he usually publishes with 5 co-authors, and they vary from paper to paper? Isn't it easy for X to fine tune the amount of exaggeration with the risk of discovery?  What if X accurately describes his contribution to A but the SC, not comprising any specialist in X's field (cf. above), overestimates its importance? It could also be that by whatever reason some co-authors do not mind if X exaggerates his own contribution. Obviously, a good single-authored article should be esteemed higher by the NRF than an equally good co-authored article, but often (plenty evidence available) the opposite happens: A brillant single-authored article by X counts less for X than a same quality article written by X and some big daddy co-author. There is a cure to all of this: For each of X's main co-authored articles A determine the contribution fraction cf (A) (a number between 0 and 1) which is the average of the estimated contribution of X to A as seen by his co-authors. The co-authors are granted anonymity and in turn commit to be ruthlessly honest. The contributed quality of X to A is defined as the product cq(A)   =  q(A)cf(A), where the quality q(A) is as defined in 5a. Finally, it doesn't take a psychologist to predict the direction of bias of a panel of dedicated co-authors when assessing a single-author.

3d The intended rating for applicant X should be calibrated against the ratings of his departmental peers. True, this is difficult if the peer works in a very different field, but still the "enthousiasm" of the respective reviewer reports can serve as a kind of reality check.

 

 

4a An alternative: Direct rating by the reviewers. The NRF maintains, and also Gudrun Schirge emphasized it several times in her talk on11.21.2007 at the University of Stellenbosch, that it is the reviewers that rate an applicant and not the Specialist Committee. Yet, as is clear from 3a, 4b and otherwise, such a claim is naďve. So why not put the money where the mouth is? Why not let ten reviewers directly rate the candidate from C3 to A1 and then take the average? Just a few thoughts about that great idea, details need to be settled. One first establishes appropriate subfields in each science, for instance within mathematics the subfields number theory, combinatorics, numerical analysis, etc. In each subfield eight "sample scientists" are displayed which carry ratings from C3 up to A1. Whether or not this can or should be done anonymously  is debatable. (Currently A,B,C are public anyway, and as long as the sample scientist agrees, why shouldn't his fine rating be public?) One now proceeds as follows.

(i) Candidate X points out to the NRF his (say) five best articles in the evaluation period.

(ii) Each reviewer thoroughly reads at least one of X's best articles and provides the NRF with its quality q(A) (see 5a).

(iii) The NRF computes the contribution fractions cf(A) of X's best articles A as outlined in 3c. These values, along with all five values q(A)
      are sent to all reviewers of X. Furthermore the reviewers are provided with some or all of the parameters fcc, fpc, snp defined in 5.

(iv) Each reviewer broadly familiarizes himself with X's output in the evaluation period, in addition to his effort (ii). Then, taking into account
      the parameters obtained in (iii) and the list of sample scientists, he confers a rating to X, ranging from C3 to A1.

4b Here are some benefits of this system

(a) Most importantly, the potential bias of the SC is eradicated. There still is the potential bias of the reviewers but that likely is less, simply for the following reason: The reviewers are international leaders in X's field and most of them do not know X personally, whereas necessarily the local members of the SC are of a lesser scientific calibre (not seldom less than X's, to the extent that they may never have produced a single-authored paper..) and often personally know X. 

(b) Promoting a potential bias of the SC is the fact that the current rating categories A1 to C3 are rather vaguely defined in the first place. For instance, what really is the difference between B3 and B2 here?

B2: Reviewers are firmly convinced that the applicant is an independent researcher enjoying considerable international recognition for the high quality and impact of his/her recent research outputs.

B3: Most of the reviewers are convinced that the applicant is an independent researcher enjoying considerable international recognition for the high quality and impact of his/her recent research outputs.

It seems to be "firmly convinced" as opposed to "convinced". And what is the definition of "convinced"? The mind boggles. It is clear that quite different ratings for X respectively Y can be distilled by the SC (or one eloquent member of it) from very similar reports. This is possible even for X and Y working in the same field, let alone different fields (cf (c)). But it can be mended  if the SC is replaced by ten independent reviewers and if the rating categories are redefined more succinctly and supplemented by a list of sample scientists. 

(c) The scientific subfields are now (4a) clearly defined and not too small. That blocks scientists becoming A2-rated by virtue of being  "world leaders" in their small field which may only comprise a few dozen people publishing in it. Furthermore, the calibration 3d becomes superfluous since the calibration of X's rating against carefully rated sample scientists is superior.

(d) The cumbersome writing and reading of reviewer reports (3) now evaporates, and so does

(e) the troublesome feedback question 2.

(f) If the NRF doesn't want to get rid of the panel altogether, one could at least weaken its potential bias by restricting its rating to be in a window determined by the lowest and highest among the ten reviewer ratings.

 

 

5  Numerical parameters. We argue that apart from screening the reviewer reports (which anyway ceases under the recommended direct evaluation 4) the performance of candidate X should furthermore be assessed by computing some numerical parameters. Depending on the particular parameter this can be done in a more or less mechanical way. The often cited "checks and balances" within the rating system would become more effective and transparent by a wise and systematic use of the four parameters discussed below.

5a The quality of  X's scientific output is reasonably well reflected through the totality of reviewer reports - the problem lies more within the SC itself and the vague rating categories. Additionaly (or alternatively if 4 is followed), it is desirable to have X's best articles A assessed individually. That is, the quality q(A) of X's five main articles A should be determined and pinned down on a numerical scale by at least two reviewers that are not co-authors of A yet commit to read A in detail. In practise that may boil down to paying these reviewers. By definition q(A) is independent of the number of co-authors (that is taken care of separately) but a plethora of other criteria that make up q(A) suggest themselves. To mention just one, if A solves a long standing open problem, that has been stated as such in the literature, this should boost q(A).

5b The  fair citation count (fcc)  is supposed to measure the impact of X's research outside his circles of friends. I propose that a fixed article A that cites some fixed article B of X should be accounted for as follows. First, in order for A to have any effect at all, the group of authors of A must not contain  X (thus discarding self citations) and, more painfully, must not contain anybody that has ever been a co-author of X. That condition being satisfied the contribution of A  to B's count fcc(B) should be 1/n where n is the number of authors of B. The fraction 1/n is not a slighting of B's impact, it only takes into account that in the same time that a hypothetical single author of X's calibre writes 1 article of B's calibre, the co-author X produces n such articles (assuming all co-authors contribute equally). By definition fcc is the sum of all fcc(B) where B ranges over some period of time. One could argue that for the fcc the 7 year evaluation period needs to be extended since citations need a longer time to kick in. Concerning the controversial  impact factor of a journal: What does it help to publish in a journal with high impact factor (i.e. on the average its articles are cited often) when your own article collects few citations?

5c In order to also properly assess the quantity of X's scientific output, his publications must not be simply counted, but rather be weighed not only according to the number of co-authors, but also by the number of pages. Define fpc to be the resulting fair publication count over the 7 year evaluation period. True, quality is more important than quantity, but still the fpc should not be belittled.

5d And there is what one could call the social networking parameter snp, which somehow takes into account X's activities in attending or organizing conferences, his invited talks, key note addresses, his activities as reviewer for journals, his membership on editorial boards, or other communities. This kind of snp usually cannot  be coupled to any specific scientific achievement, and it does indeed happen that social networking rather than scientific skills cause a high snp. Of course, things are different if the snp incorporates awards for scientific achievements (like the Wolf prize). The snp is the vaguest of all parameters. Whether and how the snp can be reflected by a single number remains an open question.

5e Both the fcc and fpc are highly sensitive to the scientific field; an fcc of 20 is low in the medical sciences but can be decent for a mathematician. Even within mathematics the differences are considerable, e.g more people work in numerical analysis than in number theory and the average fcc's of a numerical analysist respectively number theorist differ accordingly. Once the scientific field has been accounted for, the fcc and fcp should obviously still influence X's rating to a lesser extent than the reviewer reports. It makes however sense to use the fcc and fpc as lower and upper bounds for ratings. For instance, a mathematician with a fcc smaller than 30 cannot be A-rated, and a mathematician with a fcc bigger than 50 cannot be C-rated. This would  prevent (which has happened!) that a A2 rated mathematician (a "world leader" ?) has fewer citations than a C3-rated one. These remarks apply whether the current (3) or the direct (4) rating system is used.

Marcel Wild, December 2007