hargens & herting 1990 resaltado

16
Scientometrics, VoL19. Nos 1-2 (1990) 91-106 NEGLECTED CONSIDERATIONS IN THE ANALYSIS OF AGREEMENT AMONG JOURNAL REFEREES L. L. HARGENS,* J. R. HERTING** *Department of Sociology, University of Illinois, 702 South Wright Stree~ Urbana, 1L. 61801 (USA) **Department of Sociology Stanford University, Stanford, CA 94305 (USA) (Received September 4, 1989) Studies of representative samples of submissions to scientific journals show statistically significant associations between referees' recommendations. These associations are moderately large given the multidimensional and unstable character of scientists' evaluations of papers, and composites of referees' recommendations can significantly aid editors in selecting manuscripts for publication, especially when there is great variability in the quality of submissions and acceptance rates are low. Assessments of the value of peer-review procedures in journal manuscript evaluation should take into account features of the entire scholarly communications system present in a field. Introduction Referees' assessments of manuscripts submitted to scholarly journals play a key role in determining editors' eventual decisions about the fates of those manuscripts. For example, in a study of the determinants of the fmal dispositions of manuscripts submitted to the American Sociological Review, Bakanic et al. found that the zero- order correlation between the mean of referees' overall recommendations and final manuscript dispositions equalled 0.81.1 When they regressed final manuscript dispositions on mean referees' recommendation and 23 other variables describing features of manuscripts and the manuscript-evaluation process, the standardized partial regression coefficient for the mean referees' recommendation was only slightly smaller, 0.74, than its zero-order correlation.2 Thus, it is not surprising that both researchers and commentators have devoted considerable attention to the topic of inter-referee agreement on assessments of manuscripts.3-7 This literature, and associated work on the selection of papers for presentation at scholarly meetings8"9 and the selection of research proposals for funding,10"11 has four characteristics. First, it is fragmented. Although psychologists tend to be more active researchers in this area than scholars from other disciplines, studies on the Scientometrics 19 (1990) Elsevier, Amsterdam - Oxford-New York - Tokyo Akad~miai Kiad6

Upload: ken-matsuda

Post on 22-Jul-2016

10 views

Category:

Documents


1 download

TRANSCRIPT

Scientometrics, VoL19. Nos 1-2 (1990) 91-106

N E G L E C T E D C O N S I D E R A T I O N S IN T H E A N A L Y S I S O F

A G R E E M E N T A M O N G J O U R N A L R E F E R E E S

L. L. HARGENS,* J. R. HERTING**

*Department of Sociology, University of Illinois, 702 South Wright Stree~ Urbana, 1L. 61801 (USA)

**Department of Sociology Stanford University, Stanford, CA 94305 (USA)

(Received September 4, 1989)

Studies of representative samples of submissions to scientific journals show statistically significant associations between referees' recommendations. These associations are moderately large given the multidimensional and unstable character of scientists' evaluations of papers, and composites of referees' recommendations can significantly aid editors in selecting manuscripts for publication, especially when there is great variability in the quality of submissions and acceptance rates are low. Assessments of the value of peer-review procedures in journal manuscript evaluation should take into account features of the entire scholarly communications system present in a field.

Introduction

Referees' assessments of manuscripts submitted to scholarly journals play a key role in determining editors' eventual decisions about the fates of those manuscripts. For example, in a study of the determinants of the fmal dispositions of manuscripts submitted to the American Sociological Review, Bakanic et al. found that the zero- order correlation between the mean of referees' overall recommendations and final manuscript dispositions equalled 0.81.1 When they regressed final manuscript dispositions on mean referees' recommendation and 23 other variables describing features of manuscripts and the manuscript-evaluation process, the standardized partial regression coefficient for the mean referees' recommendation was only slightly smaller, 0.74, than its zero-order correlation. 2 Thus, it is not surprising that both researchers and commentators have devoted considerable attention to the topic of inter-referee agreement on assessments of manuscripts. 3-7

This literature, and associated work on the selection of papers for presentation at scholarly meetings 8"9 and the selection of research proposals for funding, 10"11 has four characteristics. First, it is fragmented. Although psychologists tend to be more active researchers in this area than scholars from other disciplines, studies on the

Scientometrics 19 (1990) Elsevier, Amsterdam - Oxford-New York - Tokyo

Akad~miai Kiad6

L. L. HARGENS, J. tL HERTING: AGREEMENT AMONG JOURNAL REFEREES

topic have been reported by researchers in a wide range of fields. Much of the research on journal referee agreement has been carried out by journal editors, who have privileged access to data, and has been reported in their annual reports or in special editorials. Few researchers have published more than one report on the topic, and the literature largely consists of opportunistic uses of available data rather than sustained efforts to explore the topic. Second, the literature exhibits a high ratio of commentary to data. 5 A recent paper on journal referee agreement 3 revealed that as of 1983 complete data for a crosstabulation of referees' assessments had been published for only two journals! Third, nearly all of this research takes a "psychometric" perspective on the topic. This perspective views submitted manuscripts as varying in their "merit" or "publishability," and asks whether referees' assessments can be expected to provide editors with reliable information about this latent variation. 12 To the extent that referees' assessments index manuscripts' relative merit, referees' independent evaluations of manuscripts should be correlated. 13 Finally, it has been the fashion, in recent years at least, to emphasize, if not exaggerate, the low level of agreement between peer-reviewers' evaluations. 1~

In this paper we discuss several points that researchers should consider in carrying out and interpreting studies of referee agreement. Some of these points are general methodological principles; for example, that variables with restricted variation tend to have low correlations with other variables, and that measures with relatively low validities can still be quite useful in selection procedures. Other points noted below involve structural and statistical characteristics of the manuscript evaluation process itself; such as how editors' practices in choosing referees may limit the observed level of agreement between referees' evaluations, and how the degree of literature concentration in a field may counteract the effects of referee unreliability at an individual journal. Those who report or comment on referee agreement studies typically neglect one or more of the points we discuss. In general, we believe that taking them into account leads one to be more sanguine than many recent discussions of the value of peer-review procedures in evaluating manuscripts.

Is referee agreement low?

We note first that many studies of referee agreement were predestined to yield low levels of agreement because they were based on samples of papers with restricted variation in merit. For example, in an early study of agreement between expert judges' evaluations of papers, Bowen et al. took advantage of data yielded by a competition for "best paper" among papers submitted to the Division of Consumer Psychology for presentation at the 1971 convention of the American Psychological

92 Scientometrics 19 (1990)

Omar
Resaltado
Omar
Resaltado

L. L. HARGENS, J. R. HERTING: AGREEMENT AMONG JOURNAL REFEREES

Association. 9 Sixteen papers were originally submitted for the competition, and a three-person committee then selected the top eight. These eight papers were then judged by 10 past presidents of the Division of Consumer Psychology. Bowen et al. reported that:

The reliability of the ratings, even with such a distinguished panel, was low. Kendall's coefficient of concordance (W) for eight papers ranked by 10 judges was 0.106 (050 > p >0.30).

Elementary statistics texts commonly discuss the fact that correlational measures are attenuated by restrictions in the variation of variables, 18 and Bowen et al. were circumspect about the significance of their findings for that reason. Those who discuss the Bowen et al. study, however, commonly report the low coefficient of concordance and its lack of statistical significance, but fail to mention the restricted variation in paper quality that probably contributed si~ificantly to tho~e results. 14

Several data-gathering strategies common to research in this area will almost always lead to samples of papers with a restricted range of variation in scholarly merit. For example, studies of previously published papers, 7'19 and studies that overrepresent highly-cited papers, such Small's study of highly cited papers in chemistry, should yield samples of papers that are nonrepresentative because they are too homogeneous, and this homogeneity will tend to produce low correlations between independent assessments of their merit. 20 Of course, even studies based on all submissions to a given journal, for example, will be subject to some degree of truncation in variation because authors do not usually submit papers to journals that are extremely unlikely to accept them (or to low-visibility journals, which may be highly likely to accept them). This truncation is a part of the normal scholarly publishing prbcess, as opposed to the severe truncations often imposed by researchers. 21

Studies of referees' overall' evaluations of fairly representative samples of submissions to scholarly journals typically report intraclass correlation coefficients in a range between 0.2 and 0.3. 22-24 Those who characterize coefficients of this magnitude as "low" rarely specify what they judge to be an acceptable level of agreement, but they apparently are concerned by the fact that these correlations are closer to zero than to unity. Obviously, few would expect them to be in the neighbourhood of 0.9 - such correlations would make the use of more than a single referee wasteful in time and expense because referees' opinions would be largely redundant. Is it reasonable, however, to expect intraclass correlations between referees' evaluations of manuscripts to be higher than 0.6? We believe that characteristics of the manuscript review process make even this level of correlation

$cientometrics 19 (1990) 93

Omar
Resaltado
Omar
Resaltado
Omar
Resaltado

L. L. HARGENS, J. 1L HERTING: AGREEMENT AMONG JOURNAL REFEREES

unlikely. For example, editors frequently choose referees who they expect to judge different aspects of a manuscript-perhaps one to judge its statistical procedures and another to judge its substantive contribution. 25 Insofar as these different aspects are not highly correlated, the referees' assessments will show low agreement. A more extreme limitation on the achievable level of agreement between referees is present in fields where scholars are divided into competing "schools" or "camps." Editors in these fields sometimes solicit referees from the members of opposing groups in order to get opinions from both sides of a controversy. If an editor were to follow this strategy exclusively, one would expect referees' overall evaluations of manuscripts to be negatively correlated, and mixing this strategy with a strategy of soliciting the opinions of neutral referees should produce a low correlation between referees'

evaluations. Agreement between referees' assessments is also limited by an inherent element

of uncertainty in making evaluations of manuscript. Griffith noted that a scientific paper is a "complex intellectual product whose meaning and value derive from a complex, highly dynamic intellectual environment. "26 Research on scientists' judgments of the relevance of papers to their current work has shown these judgments to be unstable, 27 and Griffith summarizes the implications of this

circumstance with the rhetorical question:

If a scientist cannot agree with himself on whether the content of a document relates to his own work, is it surprising that he cannot agree with others on the quality of the document?

Given the inherent uncertainty and instability in researchers' evaluations of

manuscripts, i t seems likely that the highest attainable intraclass correlation coefficient between referees' overall evaluations of manuscripts is far below 0.9; perhaps 0,5 or 0.6 is the upper bound. If so, observed coefficients between 0.2 and 0.3 may indicate reasonable levels of agreement, especially in view of the above discussed

effects of editors' strategies in selecting referees.

Are referees' evaluations too unreliable to be useful?

Regardless or whether observed levels of referee agreement are reasonable given editors' strategies in selecting referees and the inherent uncertainty associated with judging scientific manuscripts, one may question whether referees' judgments are reliable enough to be useful in selecting papers for publication. Those who have expressed doubt about the value of referee evaluations often base their arguments on unstated and incorrect assumptions. A case in point is Lindsey, 15'17 who, after

94 Scientometrics 19 (1990)

Omar
Resaltado

L. L. HARGENS, J. 1L HERTING: A G R E E M E N T AMONG JOURNAL REFEREES

suggesting that intraclass correlations between referees' evaluations are unlikely to be higher than 0. 25, treats that figure as an estimate of the reliability of the peer-review process as a whole. Although under the assumptions of classical test theory 28 the intraclass correlation coefficient can be interpreted as a reliability coefficient, it is the reliability, of a single referee's evaluation, not the reliability of a composite formed from all referees' opinions. 29 Under the assumptions of classical test theory, one would estimate the reliability of a composite evaluation combining the individual evaluations of N referees according to the formula:

Rc= NRi/Vl + (N- 1) R i

where R is the reliability of the Composite and R. is the reliability of individual c 1 referees' evaluations. 2 Thus, if the reliability of individual referees' evaluations of manuscripts equals 0.25, the reliability of a composite evaluation based on two independent referees' evaluations equals 0.40 and the reliability of a composite based on three evaluations equals 0.50. Lindsey argues further that when the reliability of journal peer-review systems fall much below 0.25, those systems are almost useless. His argument on this score is based in part on his extension of a journal decision- making model developed by Stinchcombe and Ofshe. 3~ The S-O model postulates that papers vary on a dimension of "true quality", that the judged quality of papers is imperfectly correlated With their true quality, and that the joint distribution of true quality and judged quality is bivariate normal. Stinchcombe and Ofshe also assume that the composite reliability of judged quality (i.e., that produced by an overall judgment based on referees and editors' evaluations) equals 0.5, and that this reliability level is only due to covariation With true quality. They then derive the probabilities of acceptance for papers of varying true quality levels, and also the number of papers at varying true quality levels that would be accepted from a hypothetical cohort of 100 submissions. Lindsey recalculated the numbers of papers from various true quality levels that would be accepted if the reliability of judged quality equals 0.25, which he mistakenly took to be the reliability level of composite evaluations implied by existing studies of referee reliability, 31 and concluded that:

nIf the level of reliability were to slip much further than the assumption of 0.25 (which is optimistic in view of empirical studies), it would almost be the case that all papers have an equal likelihood of being accepted. "32

Lindsey does not show the probabilities of acceptance of papers of differing true quality levels when the reliability of judged quality equals 0.25, but Table 1 gives those

Scientoraetrics 19 (1990) 95

L. L. HARGENS, J. R. HERTING: AGREEMENT AMONG JOURNAL REFEREES

probabilities as well as the probabilities associated with selected reliability levels

between 0.0 and 0.5.

Table 1 Probabilities of acceptance generated by the Stinchcombe-Ofshe model for papers of differing levels of

true quality, by composite reliability of judged quality

Probabilities of acceptance

True quality Reliability of judged quality standard score 0.0 0.I 0.25 0.4 0.5

-3 to-2 0.160 0.029 0.005 0.000 0.000 -2 to-1 0.160 0.061 0.022 0.006 0.002 -1 to 0 0.160 0.111 0.075 0.045 0.027 0 to i 0.160 0.187 0.192 0.189 0.176 I to 2 0.160 0.291 0.386 0.472 0.528 2 to 3 0.160 0.544 0.614 0.773 0.858

Lindsey bases his claim that a manuscript evaluation process with a reliability of

0.25 produces very poor results on a judgment that the probabilities of acceptance do

not vary enough across levels of true quality. A s a result, papers from high true

quality levels are often judged to be of mediocre quality, and papers of high judged

quality are often of mediocre true quality. For example, Table 1 shows that a paper

that is in the highest true-quality category-being between two and three standard

deviations above the mean-has nearly a 0.4 probability of rejection. From the

perspective of authors, this situation is unfair because some authors of high true-

quality papers have their work rejected while some authors of low true-quality papers have their work accepted. Journals employ peer-review systems more to help editors

select meritorious submissions than to insure that authors are treated fairly, however.

Let us take the perspective of an editor, then, and explore the question of whether a

peer-review system of low reliability can be useful. In the S-O model 16 percent of submissions are "high quality" papers (arbitrary

defined as being at least one standard deviation above the mean of true quality). Of these papers 14 percent are between one and two standard deviations above the

mean and 2 percent are above two standard deviations. An editor blessed with a

manuscript evaluation system with a reliability equal to 0.5 will accept 57 percent of those papers. 33 In contrast, an editor with a totally unreliable evaluation system (reliability equals 0.0) will accept only 16 percent of the high quality papers (see

96 Scientometrics 19 (1990)

L. L. HARGENS, J. R. HERTING: AGREEMENT AMONG JOURNAL REFEREES

Table 1). If the reliability of a single referee's evaluations equals 0.25 and if the editor bases her decisions about individual papers only on a composite formed from two referee evaluations, the composite's reliability will be 0.4 and she will accept 51 percent of the high quality papers. If the individual referee evaluations have a reliability slightly higher than 0.13, the reliability of the composite will be 0.25, and 42 percent of the high quality papers will be accepted. Finally, if the individual referee evaluations have a reliability of only 0.053, the reliability of the composite will be 0.10, and 30 percent of the high quality papers will be accepted. Thus, even with very low individual referee reliabilities, an editor can substantially improve her performance beyond chance in detecting high quality submissions by using the composite evaluations of referees. 34 The reader can judge Lindsey's claim that if the (composite) reliability of referees' evaluations falls much below 0.25, it is "almost ... the case that all papers have an equal likelihood of being accepted" by comparing the first and second columns of results in Table 1.

Although it enables one to derive certain outcomes of the journal manuscript evaluation process easily, the S-O model and its extension by Lindsey do not represent that process faithfully. Specifically, these models neglect the fact that while being evaluated for publication, manuscripts are sorted into different groups that receive different treatments. 35 We believe that the differing treatments will tend to yield fewer "errors" in manuscripts' final dispositions than Lindsey predicts. For example, behavioural-science journals tend to use the opinions of initial referees as a screening mechanism to determine which submissions are to be given additional consideration for possible acceptance. Those manuscripts that receive unanimously negative evaluations from the initial referees are usually rejected without further consideration, while those that receive unanimously positive or split evaluations will usually be given further evaluation. In cases where initial referees give unanimously positive evaluations of a manuscript, editors typically read the manuscript themselves and sometimes even call upon one or more additional referees for evaluations. As a result, some of the mediocre papers that slip by the initial referees will be detected by the editor and additional referees, thereby reducing the likelihood that such papers are accepted for publication. Similarly, papers that receive split decisions from the initial referees are almost always sent out to one or more additional referees for further evaluation, and this should increase the overall evaluations given to the better papers in this category relative to those given to the poorer papers. In general then, Lindsey's extension of the S-O model probably overestimates the number of"errors" in manuscript dispositions produced by peer-review of submissions because it does not adequately reflect the actual peer-review systems used by scholarly journals. 36

Scientometrics 19 (1990) 97

L. L HARGENS, J. IL HERTING: AGREEMENT AMONG JOURNAL REFEREES

In summary, we believe that Lindsey presents an overly pessimistic portrayal of scholarly journals' manuscript-evaluation processes. That portrayal is unjustified because it is based on three errors: (1) treating observed reliability coefficients as estimates of the reliability of a composite evaluation of all referees rather than as estimates of the reliability of individual referees' judgments, (2) using a "fairness to authors" standard for judging the utility of existing manuscript-evaluation procedures rather than asking whether those procedures significantly enhance editors' ability to detect superior manuscripts, and (3) using an analytic model which does not adequately reflect the procedures scholarly journals actually follow in evaluating submissions.

Some statistical considerations

Data gathered on referees' evaluations of manuscripts submitted to scholarly journals differ significantly from data obtained by experimental studies of inter-rater reliability. The latter typically involve two-or more raters, each of whom evaluates a common self of stimuli. Thus, when there are two raters, their agreement can be described in an R x R table, where R equals the number of categories raters use to rate the stimuli; one rater's ratings are represented in the columns of the table and the other rater's are represented in the rows. In contrast, studies of journal referee agreement are typically based on data for a large number of raters, with different pairs of raters (in the typical case where a journal sends manuscripts to two initial referees) evaluating different manuscripts. Since different pairs of referees judge different manuscripts, the choice of which referee's evaluation to represent in the column categories of an R x R table is arbitrary. Thus, it is inappropriate to represent such data in terms of a full R x R table, as is a common practice in the journal referee literature. A more appropriate way to summarize data on referees' evaluations of manuscripts is to enter the number of manuscripts receiving each possible combination of evaluations in the appropriate cells along the main diagonal and in the lower (or upper) triangle of an R x R table. Table 2 presents such a table for data recently gathered on a cohort of first submissions to the American

Sociological Review. 23

These data indicate, for example, that only eleven of the 322 manuscripts submitted to the ASR received the rating "accept, conditional on minor editorial changes" from both of the two initial referees, a proportion only slightly greater than

0.03.

98 Scientometrics 19 (1990)

L, L. HARGENS, J. tL HERTING: AGREEMENT AMONG JOURNAL REFEREES

Table 2

Number of Submissions to the American Sociological Review by combination of referees

recommendations. (Number of submissions equals 322)

Referee rating Percent of Referee referee rating 1 2 3 4 ratings, %

1. (Accept,conditional on 11 12 minor editorial changes)

2. (Accept, cond. on author's 13 4 9 substantial revision)

3. (Request to revise and 14 18 16 21 resubmit)

4. (Reject without 30 22 68 126 58 qualification)

100

It is obviously impossible to derive referee-specific information, such as a particular referee's mean recommendation, from this type of table. Instead, summary statistical data derived from such tables can only be interpreted as averages derived from the evaluations of a sample of referees whose evaluations are solicited by the journal in question. For example, Table 2 also shows that the proportion of all recommendations to the ASR that are "accept, conditional on minor editorial changes" equals 0.12. This figure cannot be interpreted as describing any particular referee's recommendations, only the aggregate recommendations of all the referees consulted by ASR.

Because referee-recommendation data should not be summarized in terms of full R x R tables, it is also inappropriate to use the standard chi-square test of independence to determine if referees' recommendations are statistically independent. 37 It is easy to determine the expected frequencies under the model of independence for referee-recommendation tables, however, and to calculate either a Pearsonian or a likelihood-ratio chi-squared statistic to test that model. Specifically, let the frequency in the ith row and jth column of a referee- recommendation table be symbolized as fij" Following the convention of Table 2, we stipulate that j cannot be greater than i. Thus, a table summarizing referees' recommendations when referees use R evaluation categories consists of (R 2 - R)/2+R cells. The total number of manuscripts, M, represented in such a table is given by:

Scientometrics 19 (1990) 99

L. L. HARGENS, J. 1L HERTING: AGREEMENT AMONG JOURNAL REFEREES

R i

i=1 j=1

and since each manuscript receives two initial referee evaluations, the total number of referee recommendations equals 2M. To determine the expected frequencies under the model of independence for these cells, one must calculate the proportion of recommendations that fall into each of the R evaluation categories. Letting Pi symbolize the proportion of referee recommendations that fall in the ith recommendation category, we have:

Pi=(2fii + ]~f'i + ~f i ")/2M

For example, to determine the proportion of "accept, conditional on author's substantial revision" recommendations in Table 2(P2) , one first adds twice the frequency in cell 22 to the frequencies in cells 21, 32, and 42. This total equals 61, which when divided by 644 (2 M = 2 • 322) equals 0.09, as shown in the rightmost column of Table 2. Finally, the expected frequencies under the model of statistical independence, Fij, for the cells of a referee-recommendation table equal:

Pi 2 M when i =j F . . ~ ij

2 P.P.M when i ;~j 1l

Once these expected frequencies are determined, one can calculate the Pearsonian chi-square statistic as:

R i 2

i=l j=l

and the likelihood ratio chi-square as:

R i

L2=2 ]~ ]~fij ln(fij/Fij)

100 Scientometrics 19 (1990)

L. L. HARGENS, J. R. HERTING: AGREEMENT AMONG JOURNAL REFEREES

The total degrees of freedom available in a referee-recommendation table such as Table 2 equals the number of ceils in the table minus 1, or (R2-R)/2 + R-I, and since R -1 nom-edundant proportions are estimated from the data in order to determine the expected frequencies under the model of independence, there are (R 2- R)/2 degrees of freedom associated with that model. The data in Table 2 yield a likelihood-ratio chi-square value of 28.1 with 6 d.f., and this result allows one to reject the hypothesis that the ASR referees' recommendations are statistically independence with great confidence (oL<.001). Hargens and Herting show that all published tables of journal referees' recommendations yield similarly significant chi- square values when one tests the model of statistical independence, and this indicates that the associations between referees' recommendations in those tables are extremely unlikely to have been produced by chance. 38

Researchers can estimate a number of other models from the data contained in referee-recommendation tables, one promising class being the "RC association" models developed by Goodman. 9 These models enable one to estimate relative distances between the evaluation categories used by referees, thereby overcoming the need to make arbitrary assumptions about the distances betWeen categories before one computes coefficients such as the intraclass correlation. For example, Hargens and Herring 23 present evidence that the distance between the two least favourable recommendations (e.g., "reject" and "revise and resubmit") tends to be larger than the distances between any other two adjacent referee recommendation categories.

The RC association models also make it clear that one can determine only ratios of distances between recommendation categories; 4~ and this limitation has important implications for attempts to compare the degree of agreement exhibited by different journals' referees, different funding programs' proposal reviewers, etc. Researchers have approached this issue in two ways, both of which now appear to be unjustified. First, some have argued that researchers can compare various correlations measures, such as the intraclass correlation coefficient, across journals, funding programs, etc. 41 This general approach is inadequate because such measures are influenced not only by the amount oI actual disagreement among referees, but also by the total amount of variation exhibited by the manuscripts being evaluated. 28 Ceteris paribus, the intraclass correlation between referees' recommendations will be larger if referees evaluate manuscripts of widely varying merit than if they evaluate manuscripts that vary little in merit. Hargens 6 noted that one would expect fields with relatively high levels of referee disagreement to also exhibit great variation in the merit of submitted manuscripts, and that these two offsetting forces may produce little variation across fields in correlational measures of agreement. To avoid this problem, both Hargens and Cole et al. suggest that researchers can compare measures of within-manuscript

Scientometrics 19 (1990) 101

L. L. HARGENS, J. 1L HERTING: AGREEMENT AMONG JOURNAL REFEREES

(or proposal) variation in referees' evaluations to obtain evidence about differential agreement. 42 This is possible, however, only insofar as the distances between equivalent recommendation categories are equal across journals. Because researchers can statistically determine only the relative sizes of ratios of the distances between categories for a given journal, there is no way to assess whether the condition of equal distances between categories is present across journals. It seems inappropriate to assume this condition to be present a priori because journals in different fields typically use different standards to judge whether submissions deserve publication. 43 Thus, we believe that currently available methods do not enable researchers to estimate measures of within-manuscript variation in referees' evaluations that are comparable across scholarly journals.

The broader context of referee unreliability

Discussions of the refiability of referees' evaluations typically focus on the ill effects of low levels of reliability for the assessment of a manuscript by a single journal. If an author's paper is inappropriately rejected by a journal, it is argued, both the author and the scholarly community suffer. This view neglects the role that discipline-wide journal structures play in the process of scientific communication. Several researchers have noted that fields differ in the extent to which their literatures are concentrated in a few journals. 44-46 In some fields, a single highly visible journal publishes a major proportion of the literature. Examples include the Astrophysical Journal in astronomy and astrophysics, and the Physical Review in physics. Scholars in these fields who fail to place a paper in the dominant journal usually must settle for placing it in a much less visible outlet. In contrast, other fields have diffuse journal structures characterized by large numbers of journals, each publishing a very small fraction of the field's literature. Journals in these fields tend to exhibit smaller differentials in visibility, with a set of "core" journals serving as high-prestige general outlets and large numbers of more specialized (by topic, method, etc.) journals making up the remainder of the prestige hierarchy. When an author in such a field fails to place a paper in one of the core journals, there is always the option of submitting it to another core journal; and after failing to place it in any of those journals it is still possible to submit it to any of a group of medium-visibility journals.

Studies of field variation in journal concentration suggest that the behavioural sciences have relatively diffuse journal structures, while the physical sciences have relatively concentrated structures. 44-46 As a result, even if unreliability in referees' evaluations results in the inappropriate rejection of a behavioural-science paper at

102 Scientometrics 19 (1990)

L L. HARGENS, J. IL HERTING: AGREEMENT AMONG JOURNAL REFEREES

one journal, its author(s) often may try again at another journal of roughly equivalent visibility. Garvey et al. report, for example, that compared to core physical-science journals, core behavioural-science journals publish a much greater proportion of papers that were previously rejected by another journal. Thus, the diffuse journal structures that characterize the behavioural sciences tend to ameliorate the errors and injustices that result from unreliable referee evaluations of manuscripts at any one journal.

One might expect referee unreliability to be a much greater threat to the scientific communication systems of fields that have highly concentrated journal structures because inappropriate rejections there are much more likely to relegate worthy papers to very low visibility journals. Core journals in such fields minimize this threat, however, by accepting a substantial majority of all submissions. These high acceptance rates, ranging from roughly 75 to 90 percent, result from the use of decision rules that attempt to minimize the likelihood that worthy papers will be rejected, 43 and also from the use of the single-initial referee system rather than the two-initial referee system. 47 Thus, in fields with concentrated journal structures the potential impact of unreliable referee evaluations is reduced by other features of the overall system of manuscript evaluation.

In general, we believe that it is important to view the problem of referee unreliability, as well as other potential weakness of journals' assessments of manuscripts, in the context of the whole scholarly communication system. 48 By failing to do this, critics of the operation of scholarly journals typically exaggerate the deleterious effects of the uncertainty and ambiguity inherent in both the performance and evaluation of research.

Notes and references

1. We thank Professor Bakanic for furnishing this information. 2. V. BAKANIC, C. McPnAIL, IL J. SIMON, The manuscript review and decision-making process,

American Sociological Review 52 (1987) 631. 3. G.J . WHrrEHURST, Interrater agreement for journal manuscript reviews, American Psychologis~ 39

(1984) 22. 4. S. LOCK, A Delicate Balance: Editorial Peer Review in Medicine, Philadelphia, Penn., ISI Press, 1986. 5. E. GARFIELD, Refereeing and peer review. Part 1. Opinion and Conjecture on the effectiveness of

refereeing, Current Contents, (1986) No. 31,3. 6. E. GARFIELD, Refereeing and peer review. Part 2. The research on refereeing and alternatives to the

present system, Current Contents, (1986) No. 32,3. 7. D . P . PETERS, S. J. CEcI, Peer review practices of psychological journals: The fate of published

articles, submitted again, Behavioural and Brain Sciences, 5 (1982) 187. Also see the commentary on this paper in the same issue of Behavioural and Brain Sciences.

8. P. McRwvNOLDS, Reliability of ratings of research papers, American Psychologist, 26 (1971) 400.

Scientometrics 19 (1990) 1.03

L. L. HARGENS, J. R. HERTING: AGREEMENT AMONG JOURNAL REFEREES

9. C .M. BOWEN, R. PERLOFF, J. JACOBY, Improving manuscript evaluation procedures, American Psychologist, 27 (1972) 22.

10. S. COLE, J. R. COLE, G. SIMON, Chance and consensus in per review, Science 214 (1981) 881. 11. D. KLAIqR, Insiders, outsiders and efficiency in a National Science Foundation panel, American'

Psychologist, 40 (1985) 148. 12. It is worth noting that journals may employ peer review procedures for other reasons beyond their

role in helping editors select manuscripts for publication. For example, journals' use of peer-review procedures sometimes symbolize aspirations to scholarly respectability, and the comments referees make about the papers they review are commonly forwarded to authors for use in modifying their papers. Analyses of peer review commonly neglect these additional functions and tend to focus on measuring referee agreement and determining the effects of low levels of agreement. In general, the psychometric perspective provides concepts suited for analyzing these latter issues, and even though one can question some of its basic assumptions, it is popular because it can show quantitative implications of imperfect referee assessments.

13. Of course, referees' evaluations may show agreement because they reflect other variables beside merit; "reliability is not the same as validity." To our knowledge, researchers have reported only two studies of the association between scientists' evaluations of papers and independent indicators of those papers' merit. Small reported data on original referees' assessments of a sample of highly cited papers in chemistry, which showed a nonsignificant correlation between the referee assessments and the subsequent citation levels (see H. G. SMALL, Characteristics of Frequently Cited Papers in Chemistry. Final Report on NSF Contract NSF-C795, Philadelphia, 1974). In contrast, Gottfredson found more substantial positive correlations between citations to published psychology papers and overall judgments of those papers' quality and impact made by experts nominated by the papers' authors (see S. D. G(YITFREDSON, Evaluating psychological research reports: Dimensions, reliability, and correlates of quality judgments, American Psychologist, 33 (1978) 920).

14. M.J. MAHONEY, Scientist as Subject: The Psychological Imperative, Cambridge Mass., Ballinger, 1976. 15. D. LINDSEV, The Scientific Publication System in Social Science, San Fransisco, Jossey-Bass, 1978. 16. L. L. HAROENS, Scholarly consensus and journal rejection rates, American Sociological Review, 53

(1988) 139. 17. D. LINDS~,', Assessing precision in the manuscript review process: A little better than chance,

Scientometries, 14 (1988) 75. 18. See, for example, H. M. BLALC~K, JR., Social Statistics, New York, McGraw-Hill, 1979. 19. A. W. WARD, B. W. HALL, C. F. SCHRAM, Evaluation of published educational research, a national

study, American Educational Research Journal, 12 (1975) 109. 20. An exception to this generalization is Peters and Ceei's study of the reevaluation of previously

published papers. Although they apparently studied a very homogeneous sample of papers, referees never disagreed in their evaluations! However, the Peters and Ceci data pertain to only nine papers.

21. Unrepresentatively homogeneous samples of papers are also produced when editors summarily reject a large proportion of submissions. To the extent that editors screen out manuscripts that referees would judge to be of poor quality, studies based on the remaining papers that receive referee evaluations will tend to show low levels of agreement between referees. High-prestige multidisciplinary journals, high-prestige medical journals, and social science journals are most likely to exhibit high summary rejection rates. See M. D. GORDON, A Study of the Evaluation of Research Papers by Primary Journals in the U.K., Leicester, England: Primary Communications Research Center, University of Leicester, 1978.

22. W. A. Scoa'r, Interreferee agreement on some characteristics of manuscripts submitted to the Journal of Personality and Social Psychology, American Psychologist, 29 (t974) 698.

23. L. L HAROENS, J. R. H~a'INO, A new approach to referees' assessments of manuscripts, Social Science Research (forthcoming).

104 Scientometrics 19 (1990)

L. L. HARGENS, J. R. HERTING: A G R E E M E N T AMONG JOURNAL REFEREES

24. Lindsey op. cit. reference 17 above, suggests that they are likely to be lower, but seems to base his judgment on results from the numerous studies that have been subject to trtmcated variation rather than those studies that have been based on more representative samples of manuscripts.

25. H. L. ROEDIGER III, The role of journal editors in the scientific process, in D. N. JACKSON, J. P. RUSHTON (Eds,) Scientific Excellence: Origins and Assessment, Beverly Hills, CA:Sage, 1987, 222.

26. B. C. GRIe~rrri, Judging document content versus social functions of refereeing: Possible and impossible tasks, Behavioural and Brain Sciences, 5 (1982) 214.

27. T. SARACEVIC, Relevance: A review of and framework for thinking on the notion in information science, Journal of the American Society for lnformation Science, 26 (1975) 321.

28. J .C. NUNNALLu Psychometric Theory, New York, McGraw Hilt, 1967. 29. H . E . A . TINSLEY, D. J. WEISS, Interrater reliability and agreement of subjective judgments, Journal

of Counselling Psychology, 22 (1975) 358. 30. A . L . STINCHCOMBE, 1L OVSHE, On journal editing as a probabilistic process, American Sociologist, 5

(1969) 19. 31. Hargens (footnote 12 in op. cit. in reference 16 above) also made this error. 32. Op. cit. reference 15, p. 37. Op. cit. reference 17, p. 78. 33. This is because the probability of acceptance for the top two percent of manuscripts equals 0.86 and

the probability for the next 14 percent equals 0.53. Thus, the overall proportion of papers in these two levels that are accepted is [(0.14 X 0.53) + (0.02 • 0.86)]/0.16=0.57.

34. These results illustrate the point that measures with low reliability, and therefore low validity, can be valuable when selection ratios are low and there is substantial variation among cases being evaluated (see L. J. CRONBACH, Essentials of Psychological Testing (3rd Ed.), New York, Harper and Row, 1970). Lindsey and others have argued that referees' evaluations of manuscripts are more likely to be unreliable for behavioural science journals than for natural science journals, but the former are also more likely to exhibit the two conditions that enhance the practical value of even fairly unreliable evaluations. Highly selective and prestigious medical journals also exhibit these two conditions, and Lock, op. cir., estimates that the British Medical Journal accepts 80 percent of the top quality papers submitted to it.

35. The S-O model simply specifies that the evaluation of all manuscripts has a certain reliability and proceeds to derive certain outcomes of the evaluation process. Stinchcombe and Ofshe use the model as an "ideal-type" to generate hypothetical predictions. In contrast, Lindsey's charges against the peer-review system in behavioural-science journals are based on using the S-O model as an empirically adequate description of the manuscript evaluation process in those journals.

36. Lindsey's recommendation that journals solicit the opinions of at least three initial referees for each submission also exaggerates the benefits of such a policy by neglecting the peer- review system used by journals. For most behavioural-science journals, which are the focus of Lindsey's discussion, a substantial minority of manuscripts (those receiving a split decision from the two initial referees and even some of those receiving two positive evaluations) already receive three referee evaluations. Using three initial referees for all papers will increase the reliability of the composite evaluations of only those papers that would receive two unfavourable evaluations under the current system. Unfortunately, using three initial referees for these papers would also slow down their evaluation, and authors appear to be more concerned about the speed of the journal review process than about the reliability of referees' evaluations (see Y. BRACKBILL, F. KORTEN, 'Journal reviewing practices: Authors ' and APA members ' suggestions for revision, American Psychologist, 27 (1972) 22). Using three initial referees might speed up the evaluation of the remaining manuscripts somewhat (because editors would not wait until the first two referees returned recommendations before soliciting the opinion of the third), but the fact that these constitute a minority of submissions would probably not allow the time savings experienced for them to counterbalance the longer lags

Scientometrics 19 (1990) 105

L. L. HARGENS, J. R. HERTING: AGREEMENT AMONG JOURNAL REFEREES

experienced in evaluating the very large proportion of manuscripts that receive only two evaluations under the current system.

37. See Blalock, op. cit., p. 282-290. 38. Lb~dsey (op. cit. reference 17) reports a non-significant chi-squared value for a "quasi-

independence" model applied to data from one of these journals, Personality and Social Psychalogy Bulletin. Unfortunately, Lindsey does not specify which model of quasi- independence he tested. We have been able to obtain the chi-squared value he reports only by (1) treating the P&PB data in Lindsey's Table 2 as frequencies (they are actually percentages) and (~) constraining the model to reproduce the entries along the diagonal of Lindsey's Table 2 (some of these entries represent disagreement and others represent agreement). Thus, it is doubtful that Lindsey's analysis tested any meaningful hypothesis, much less the null hypothesis that referees' judgments are statistically independent.

39. L.A. GOODMAN, New methods for analyzing the intrinsic character of qualitative variables using cross-classified data, American Journal of Sociology, 93 (1987) 529.

40. C. C. CLOGG, Using association models in sociological research: Some examples, American Journal of Sociology, 88 (1982) 114.

41. See J. R. COLE, S. COLE, Which researcher will get the grant?, Nature, 279 (1979) 575-576, and Gordon, op. "cit. These measures include various estimates of the proportion of the total variance in referees' assessments that is between- or within-manuscript (or proposal) variance.

42. See S. COLE, G. SIMON, J. R. COLE, Do journal rejection rates index consensus?, American Sociological Review, 53 (1988) 152, and L. L HARGENS, Further evidence on field differences in consensus from the NSF peer review studies, American Sociological Review, 53 (1988) 157.

43. H.A. ZUCKERMAN, 1L IC MERTON, Patterns of evaluation in science: institutionalization, structure and functions of the referee s)*tem, Minerva, 9 (1971) 66.

44. R.E. STEVENS, Characteristics of Subject Literatures, ACRL Monograph No.6, Chicago, Association of College and Reference Libraries, 1953.

45. C. H. BROWN, Scientific Serials, ACRL Monograph No.16, Chicago, Association of College and Reference Libraries, 1956.

46. W.D. GARVEY, N. LIN, C. E. NELSON, Some comparisons of communication activities in the physical and social sciences, In: C. E. NELSON, D. IC POLLOCK (Eds.) Communication among Scientists and Engineers, Lexington, Mass.: Heath, 1970, P.61.

47. Op. cit., reference 16. One reason that studies of referee reliability are relatively rare for physical- science journals is that such journals often use the single initial referee system. Thus, data on pairs of referee assessments of all submissions are unavailable for these journals. Those manuscripts that do receive at least two independent referee evaluations under this system are an unrepresentative subset of all manuscripts. Thus, nonexperimental data on referee agreement for these journals, such as the evidence reported by Zuckerman and Merton, should be viewed with caution.

--48. W. D. GARVEY, Communication: The Essence of Science, Oxford, Pergammon, 1979.

106 Scientometrics 19 (1990)