11 Comments
Guest post by Bruce Reed, Deputy Director of the NIH Center for Scientific Review, originally released on the Review Matters blog
Over the past several years we have heard consistent concerns about the complexity of review criteria and administrative load of peer review. CSR shares the concern that the current set of standards has the unintended consequence of dividing reviewer attention among too many questions, thus reducing focus on scientific merit and increasing reviewer burden. Each element was intended make review better, but we worry that the cumulative whole may in fact distract from the main goal of review — to get input from experts on the scientific and technical merit of the proposed work.
To address these concerns, CSR has convened a working group of our advisory council, charged with recommending changes to research project grant review criteria that will improve review outcomes and reduce reviewer burden. The group is co-chaired by Tonya Palermo and me, and includes some of our council members, other members of the scientific community, and the NIH Review Policy Officer from the Office of Extramural Research.
We would like to hear your thoughts on the issue. How might review criteria be modified to obtain the best evaluations of scientific merit? You can provide feedback directly to me at [email protected], to [email protected], or to any member of the working group. Before you fire off that email, though, read on.
First, be aware that current criteria derive from multiple regulations; changes that conform to them well are more feasible than those that don’t. The Code of Federal Regulations (42 C.F.R. Part 52h.8) requires that research project applications be evaluated based on significance, investigators, innovation, approach, and environment. Protections for humans, animals, and the environment, adequacy of inclusion plans, and budget must be evaluated. The “21st Century Cures” Act (Public Law 114-255) requires attention to rigor and reproducibility and aspects of clinical trials. That said, there is room for improved implementation.
Second, consider how simplified criteria that might also help address some of the issues below:
- Multiple studies show that reviewer ratings of Approach carry the most (perhaps too much) weight in determining overall impact scores. Yet, aspects of rigor and reproducibility are too often inadequately evaluated. Can better criteria help?
- Review is often criticized as being risk-averse, as too conservative. If you agree, how might revised criteria help?
- How can criteria be defined to give the applications of all investigators, regardless of their race, ethnicity, gender, career stage, or setting, fair hearing on a level playing field?
Third, focus on the criteria for R01s. The criteria for training grants (F’s, K’s, T’s) and for SBIR/STTR grants are different. Addressing criteria for R01s would be a great start.
Finally, please be patient. Getting from good ideas to a revised set of criteria is a complex, multi-level process that will include NIH’s Office of Extramural Research, eRA, NIH Institutes and Centers, Office of the General Counsel, and other relevant stakeholders. This is a preliminary effort to get your input on what changes we should think about. Were we to propose regulatory changes, we would ask for additional public input. We are starting a conversation. Share your ideas.
Members of the CSR Advisory Council Working Group
Co-Chairs
Palermo, Tonya M., Ph.D.
Professor, Department of Anesthesiology and Pain Medicine
Principal Investigator and Associate Director, Center for Child Health, Behavior and Development
Seattle Children’s Research Institute
Reed, Bruce, Ph.D.
Deputy Director
Center for Scientific Review
National Institutes of Health
Members
Amero, Sally, Ph.D.
Review Policy Officer
Office of Extramural Research
National Institutes of Health
Corbett, Kevin D., Ph.D.
Associate Professor
Department of Cellular and Molecular Medicine
University of California, San Diego, School of Medicine
Gao, Jinming, Ph.D.
Professor of Oncology, Pharmacology, and Otolaryngology
Co-Leader, Cell Stress and Nanomedicine Program
Simmons Comprehensive Cancer Center
UT Southwestern Medical Center
George, Alfred L., M.D.
Chair, Department of Pharmacology
Director, Center for Pharmacogenomics
Magerstadt Professor of Pharmacology
Northwestern University
Hurd, Yasmin L., Ph.D.
Professor, Department of Psychiatry,
Neuroscience, Pharmacological Sciences
Director, Addiction Institute of Mount Sinai
Icahn School of Medicine at Mount Sinai
Janelsins-Benton, Michelle C., Ph.D.
Associate Professor
Departments of Surgery, Neuroscience, and Radiation Oncology
University of Rochester, Medical Center
King-Casas, Brooks, Ph.D.
Associate Professor
Fralin Biomedical Research Institute
Department of Psychology, Virginia Tech
Kroetz, Deanna L., Ph.D.
Professor, Department of Bioengineering and Therapeutic Sciences
Director, Pharmaceutical Sciences and Pharmacogenomics Graduate Program
University of California, San Francisco
López, José A., M.D.
Professor of Medicine, Hematology
Member of Bloodworks NW Research Institute
Adjunct Professor, Biochemistry, Mechanical Engineering, and Pathology
University of Washington, School of Medicine
Appreciate this – I’m sure a lot of folks out there have views on the present review process. Looking forward to learning about these views as well as the outcome on how they changed the present review process.
>Multiple studies show that reviewer ratings of Approach carry the most (perhaps too much) weight in determining overall impact scores. Yet, aspects of rigor and reproducibility are too often inadequately evaluated. Can better criteria help?
Rigor must be a part of the approach. The issue here is that rigor is ill-defined, and some funded investigators that are on study sections do sexy but not necessarily rigorous research. I personally saw very little discussion of the rigor in study sections – but when it is raised, it is hard to argue with when put nicely. I think SROs need to shift focus of how they select people to be on a study section. Focus on people not just with expertise but more with rigor of their research. This will actually require SROs to start understanding not only if a person has expertise but whether his/her research is rigorous. Does NIH staff know what rigorous research is? Perhaps a test on this could be useful, :-). Start change within as you ask us to change!
For example, SROs could ask the potential reviewers on their study section to provide description of what they consider rigorous and non rigorous research. And then perhaps give examples of each. Then evaluate whether these are good examples – perhaps work with some other faculty or (maybe) philosophers!
Some of points of how I look for for examples of rigorous research: design of (some) experiments should be better indicated (e.g., proper controls with clear understanding what controls will control for); formulating one hypothesis is low rigor and bias.
>Review is often criticized as being risk-averse, as too conservative. If you agree, how might revised criteria help?
It is all about the money. Perhaps increasing budget for R21 will help. If R21 budget can be larger (perhaps up to 250k/year) and duration longer than current 2 years, perhaps even on par with R01s, more studies of a larger magnitude but a higher risk could be allowed, so people would apply more to R21s. Study section could recommend R21 rather than R01 mechanism for “risky” proposals in R01 category. And with more R21 applications, perhaps R21s can be percentiled too. Within R01, when $2m of funding is requested, it is hard to not be conservative about the proposed work. (Even if the budget is not a consideration for review, reviewers know the budget and may have unconscious bias against some applications. I know that I probably do.)
A colleague of mine suggested that having R21a and R21b types of application in which R21a are “true” R21s with NO preliminary data, limited budget and duration, and R21b in which budget can be large and duration can be longer but preliminary data are needed. (Still in spirit of high potential and high risk for the proposal to move field forward).
>How can criteria be defined to give the applications of all investigators, regardless of their race, ethnicity, gender, career stage, or setting, fair hearing on a level playing field?
A colleague of mine said that in 45 years study sections improved dramatically in terms of such biases, so progress has been made!
I do know about recent “bias” data but I don’t believe that this bias comes from reviewers knowing the person’s background, race, etc. Rather, it is driven that the applicants do write differently their proposals and we evaluate those as such. The applicants are also less likely to get outside help (being a minority), hence, quality of the application suffers.
I think we should keep stating that people need to be mindful of the biases and try to control them. Talking openly about it is probably 1st step. Second is to help with training – for those who want it – but only those trainings that have been shown scientifically to reduce bias. I am not sure those exist, though.
I wonder if there should be “diversity” advocate(s) allocated as a position in the study section. Similarly to the chair. For example, we have this when we hire new faculty. Diversity advocate’s job on the committee is to ask if we are ignoring high quality applicants because of the race/sex etc. Raising this as a question constantly does wonders, sometimes. This would require this diversity advocate to read more applications, perhaps those in the streamlined half – with the help of SRO – to see if there was a potential for bias.
Improving review process.
One of the most stochastic things in the review I noticed is who are the 3-4 reviewers that the grant gets. This could either be great or bad for a given grant. The way reviewers are chosen per application is the least transparent process. I wonder if it should be “random”. Or perhaps some data needs to be collected to understand review process for different reviewers and balance those so every applicant has equal chance of being streamlined (triaged). It is so easy to get an 8 and be streamlined (and not rescued), so this must be researched and improved.
Scientific review in competitive peer-reviewed NIH USA Grants warrants a precise, meticulous and neutral expert critical research analysis of innovative proposals/grant-concepts in the ever-expanding medical research arena!
With my proven excellence in academics, medical research and teaching spanning both USA, primarily States of Texas, New York and Nebraska, as well as homecountry India (hometown Lucknow, New Delhi, Udaipur), I had the opportunity of conceptualizing, planning and competitively publishing my first/senior authorship articles in leading peer-reviewed journals for timely and global dissemination of emerging scientific concepts so as to eventually devise and develop strategically and collaboratively a cost-effective public health research model for significantly diminishing the increasing burden of life-threatening diseases amongst susceptible genetically disparate population-subsets; in this context, I would like to provide my critical inputs regarding the scientific review-trends and funding-grants management at NIH USA Grants divisions/departments:
#1. Critical initial review of the independent investigator’s and/or co-investigator’s Curriculum vitae including first authorships with expertise as senior authorships in leading peer-reviewed journals in broad areas of interest in biomedical sciences, life sciences, translational medicine and public health; this would ensure a stringent inclusion criteria of potential investigators submitting projects/grants for federal and/or state-funding for conducting clinically ompactful innovative research in USA and/or collaborative satellite centers globally.
#2. Next sequel should be relatively more stringent in terms of evaluating the novelty of the research study with specific aims/objectives, materials-methods including ethical human subjects research, timeline of the proposed study with sound and public health oriented hypothetical model for drawing definitive conclusions at the end of the study. Broad-spectral clinical impact of the findings and/or study-rationale should be clearly specified by the independent competent investigator vying for federal grants to conduct scientifically ethical, plagiarism-free and high-quality research in USA and/or collaborative centers globally.
#3. Grants-scientific review scores should be stingently awarded to investigators who submit their competitive grant proposals based on long-term public health significance of the proposes study, clear/crisp draft of grant-proposal-concept with cited references of at least 10 first/senior authorships proving the capability,competence and independent thought-process and critical research skills of the investigator submitting the grant-project-proposal for efficiently and strategically conducting the research study.
#4. Subsequent telephonic interviews and/or presentation/brain-storming sessions should be conducted at the NIH USA Grants-offices by competitively inviting independent/senior investigators with the requisite competence of critical research skills in medical sciences as evident/proven in at least 10 first/senior authorships; thereafter, letters of intent and reference letters should be scrutinized for stringently assessing the authenticity of novel investigator-initiated grants-proposals.
#5. Round-table meetings of thought-leaders should be thereafter organized with tally-of-scores of independent grant-proposals; overlapping research concepts/grants-proposals and unethical practices should be stringently considered prior to award of final grant-scores by the Grants-officials so that grant-submitting-investigators may stongly adhere to good research practices-code-of-conduct without plagiarising, stealing concepts/pilot research data-sets of contemporaries and posing as innovative, indepenedent grants-project=proposals-investigator(s)!|
#6. Competitively evaluated grants-project-proposals and concept-notes with the investigator’s Curriculum vitae including competence/proven excellence as first/senior authorships as evident in at least 10 peer-reviewed publications, should be scientifically reviewed by the NIH USA Grants-Expert(s) in the final phase with initial 6-months allocation of federal grants-USD funds; once the funded research grant-proposal’s six-monthly progress-report is submitted to the Scientific review-offices at NIH Grants, USA, thereafter, the next allocation of subsequent six-months’ grants’ USD/funds should be allocated at the Principal investigator’s study-site so as to stringently and competitively ensure that the NIH USA Grants-federal funds in USD are being successfully and ethically utilized by the research investigator for conducting the study with timely first/senior authored-publications!
#7. I am amazed at the simplicity yet the critical scientific compexity of the GRANTS-MEDICAL RESEARCH-PUBLIC HEALTH field and constanstly endeavor to professionally develop my expertise in the ever-expanding, not-for-profit, public health oriented Grants-management arena!
In a similar survey conducted several years ago, I have raised an issue that was welcomed by many who responded, yet ignored when a summary report appeared several months after the survey. As I do not see it mentioned in the solicited input, I will raise it yet again. It is the issue of maintaining the continuity of the review process. As we are all aware of, many (if not most), proposals require a resubmission that contains the additional Introduction page (aka, rebuttal). While there are reviewers who take into account in their review how previous concerns were addressed in the resubmission, not all do (and in that, the continuity and uniformity of the review process is not guaranteed). While this may not necessarily reduce the burden, it is essential for the fairness of the process and will eventually streamline the process. It is my strong opinion that it should be a specific review criterion for any resubmission.
I have participated in the review process for about 30 years. My view is the ‘ old ‘ methodology ( 3 allowed cycles, long description, all applications reviewed, long discussion of each application, etc), may have had several drawbacks, in particular from the CSR perspective and costs. But it provided a much better ‘barrier’ to reviewers bias, in either direction. With the current mechanisms, to prevent the ‘ dumping’ of a potentially good application into the not reviewed bin, it is essential to involve 4 reviewers and a criteria for dismissing an outlying score. Just to make clear what I mean. consider a 2, 7, 2 scoring. I will be suspicious of the value of the obvious outlier. Yet, if this 3,66 average falls below the cut off, this application goes first to the not reviewed bin. True, It may be resurrected by request, but this happens only a fraction of the time. With four scores, let say 2,3,3,7, the obvious outlier is not considered for the decision to include/exclude and the outlier reviewer can explain his/her reason to the full peer session (if so he/she wishes).
Read a comment which seems to lament the fact that for most reviewers the primary overall scoring criteria is related to their scoring of the Approach. Duh! the peer reviewers are scientist. The day other criteria override the approach as primary criteria for scientific review it is the day the review process is not longer scientific in nature. Rigor is good, but it valuable only for valuable science.
By the way, why not to leave the decision about commentator email publishing to the commentators themselves?
Re: rigor — this seems to be the latest trend to focus on, and it risks to derail the most important part of the process: to improve the quality of peer review, which is WAY too conservative. Was the proposal on PCR written with “rigor”? Probably not, but it changed our lives. It sounds like another term invented by an admin person who does not do real science, and will make all of us waste tons of time justifying the great rigor of our science just to make someone happy. What really matters is that our science seeks (and has) impact.
Re: conservative reviews — that is the prerrogative of the CSR! The CSR needs to change the culture of the reviewers and mandate riskier proposals. A senior investigator once told me when I started that he never wrote an R01 until he “had already climbed the mountain” (in his lab) and could propose to reviewers how he pretended that he would climb it. He was very successful indeed at getting R01s, but I don’t find his science very respectable or particularly imaginative. Either I’m wrong, or it’s people like him that dominate the study sections. Most ground-breaking work is done by young professors who have access to riskier funding, but after a few years, that funding pot is no longer accessible to them.
Re: bias/discrimination — I have read that it happens but it is beyond my dignity to think any less or more of anybody because of their gender, race, etc., and I think that’s the norm among my colleagues, so I’m not sure how on Earth this bias enters the system …
If memory serves, the scientific community in Australia performed a study on the repeatability of study sections to fund the best research in a set group of proposals. As I recall, several study sections reviewed the same set of proposals and made their funding recommendations. The top 1 or 2 grants in the pile were funded by each study section; however, decisions to fund grants below the top were seemingly random. Thus, study sections can make reproducible decisions only up to a point and no further. This suggests our current selection procedure may be a large expenditure of unnecessary effort.
If study sections are, in fact, unable to reliably select intermediate meritorious proposals, then perhaps a change in their mission is in order. For example, can a study section reliably choose proposals that should not be funded; i.e. those with fatal flaws that should be triaged? Thus, perhaps reviewers would present their arguments for the fatal flaws of the lower 30-50% of proposals. it would then be up to the study section to decide which flaws are fatal. The applicants would benefit because they would receive feedback about specific problems, rather than have to deal with the ambiguities of catch-all phrases like “too descriptive” or reviewers that like to rewrite grant proposals in their own image.
In addition to fatal flaw presentations, the study section would determine which are the top few meritorious proposals. The remainder would be selected at random from the non-triaged proposals; after all, the Australian study concluded that funding of intermediate proposals was a seemingly random process. Actually making the process random would minimize (even if only to a small degree) the possibility of bias for any specific proposal.
My suggestions:
(1) Replace the “signfiicance” and “innovations” subscores with a single “potential for impact” section. The “potential for impact” sections is nice in a few regards. First, it captures significance and innovation together. Second, the wording “potential” denotes that riskiness is acceptable and not a thing to be avoided. Third, and mostly important, it recognizes that “innovation” is less important than “impact.” The problem is that many high impact studies are not necessarily high in innovation (e.g., think nutrition studies or replication studies). If we emphasize innovation above impact, we may overfund studies of new proteins and genes, for example, over less innovative studies, such as more rigorous clinical trials of interventions that may actually save more lives.
(2) Consider lumping together “investigator” and “environment” subscores into a single category. Not sure exactly how to do this, or what to name this. Or make environment not part of the scored criteria.
I strongly agree with the comments by J. Mario Wolosin, etc. that it is important to have 4 reviewers for each proposal and drop the highest score to calculate the average for consideration of streamlining. A score of 7-10 from only one reviewer is quite likely to kill a potentially good grant.
Given the fact that certain highly creative proposals will never be ranked high enough for funding and certain so-so proposals will always be funded for various reasons, I suggest NIH change the review and award policy to the following: 1) The only function of the review process is to produce of list of fundable proposals, i.e., a proposal will be stamped either “fundable” or “unfundable.” 2) NIH then, according to available funds, randomly select a number of proposals from the fundable proposal pool for awards. I think this review and award model can be backed up by statistical analysis in terms of fairness, effectiveness, and best use of limited resources.
The topic is review criteria, and I like the drift of the revisions discussed by the CSR Advisors earlier in 2020 and in September 2019. But the thread has broadened to encompass a more fundamental issue about scoring and ‘not discussed’ applications, with JMW drawing attention to something that often is clear at study section meetings. Namely, the ‘third reviewer’ often assigns an outlier score that drags a meritorious application’s mean into the ‘Not Discussed’ range.
At first blush, the mean of the initial score might be based solely on Rev1 and Rev2, but it’s not always Rev3 who provides the outlier worst score.
The problem is reliance on the arithmetic mean.
Here, CSR can do much better than its current reliance on the arithmetic mean of a score distribution based on as few as 3 initial scores and rarely more than 5 initial scores. Undergraduates taking statistics courses learn that the arithmetic mean can be a flawed indicator of central tendency even when the number of score values is at n>1000, depending on the shape of the distribution — let alone n=3 to 5 scores. (The current CSR approach ignores what we teach in the first statistics course.)
JMW proposes dropping the highest (worst) score. A simple alternative is to derive the variance as well as the mean, set X=mean and Y=variance, and assign units of study section discussion time as a function of zones in the Cartesian space. Clearly, a mean of 1 with zero variance should require very little discussion time before final scoring, possibly just 1 minute of discussion. Indeed, the same might be true for any zero variance summary score, irrespective of the mean. At the meeting, a 1 minute discussion timer might be set so as to allow a reviewer to ‘rescue’ applications that address emergency problems. Consider for example the applications that propose SARS-Cov-2 antigen and antibody test solutions we need now, but the approaches have features of “as good as it’s going to get any time soon, but not yet ideal or perfect.” For an initial zero variance score distribution of 3-3-3 and a typical ‘Not Discussed’ outcome, it would take 1 minute for one articulate reviewer to make a case in its favor, prior to final scoring by all study section members. The current CSR system makes no allowance for a quick generation of final scores for _all_ applications, and the end-result is that institute program officials are not allowed to put any of the ‘Not Discussed’ applications on their roster of proposed awards (because these applications have no final score).
Revising the review criteria is a meritorious activity, but I suspect it is working around the edges of the central problem that was created when ‘triaging’ and ‘Not Discussed’ solutions, based on the flawed arithmetic mean, were substituted for the previously thoughtful extramural review approach of allowing very short discussions of every application, including the ‘fatally flawed’ applications, and specifying very very short discussions when the initial scores received virtually perfect scores with zero variances.
(Because this blog post was about simplifying the review criteria, I’m going to reiterate that I appreciate the CSR discussions and presentations for its Board of Advisors in Sept 2019 and January 2020, and I think the new ideas are worth trying. But I am concerned that this initiative is just working around the edges of a more fundamental problem that cannot be fixed by changing the review criteria, if the CSR retains the flawed approach of triaging based entirely on the arithmetic mean of initial scores with no attention given to the issue of the variance of the score distribution. Especially when Rev1 and Rev2 both give scores in the 1-3 range, the process of throwing that application into the ‘Not Discussed’ and unscored bin has a clear flaw of much more central importance than any simplification of review criteria.)