30 Comments
“As the scientific community continues to evolve, it is essential to leverage the latest technologies to improve and streamline the peer-review process. One such technology that shows great promise is artificial intelligence (AI). AI-based peer review has the potential to make the process more efficient, accurate, and impartial, ultimately leading to better quality research.”
We suspect many of you were not fooled into thinking that was us who wrote that statement. A well-known AI tool wrote those words after we prompted it to discuss using AI in the peer review process. More and more, we are hearing stories about how researchers may use these tools when reviewing others’ applications, and even writing their own applications.
Even if AI tools may have “great promise,” do we allow their use?
Reviewers are trusted and required to maintain confidentiality throughout the application review process. Thus, using AI to assist in peer review would involve a breach of confidentiality. In a recently released guide notice, we explain that NIH scientific peer reviewers are prohibited from using natural language processors, large language models, or other generative AI technologies for analyzing and formulating peer review critiques for grant applications and R&D contract proposals.
Reviewers have long been required to certify and sign an agreement that says they will not share applications, proposals, or meeting materials with anyone who has not been officially designated to participate in the peer review process. Yes, this also means websites, apps, or other AI platforms too. As part of standard pre-meeting certifications, all NIH peer reviewers and members of NIH advisory councils will be required to certify a modified Security, Confidentiality and Nondisclosure Agreement signifying that they fully understand and will comply with the confidential nature of the review process, including complete abstention from Artificial Intelligence tools in analyzing and critiquing NIH grant applications and contract proposals.
In other words, grant application and contract proposal materials and other privileged information cannot be shared or disseminated through any means or entity. Let’s explore this issue further with some hypothetical examples.
After agreeing to serve, Dr. Hal was assigned several grant applications to review. Hal has had a lot of experience writing grant applications before and knows how much effort and time they require. Even with that in mind, they were daunted when trying to give each of these applications their attention and appropriate review.
So, they turned to AI. They rationalized it would provide an unbiased assessment of the research proposed because it should be able to pull from numerous references and resources fairly quickly and distill the relevant information. And, to top it off, Hal even found a platform that was trained on publicly available biomedical research publications and NIH funded grants.
Not seeing a problem, Hal fed the relevant information from the applications into the AI. Moments later, it gave an assessment, which Hal used as a starting point for their critique.
Here is another scenario:
Dr. Smith just finished reading, what seems to be, way too many grant applications. Exhausted they may be as an NIH peer reviewer, their job is not done until those critiques are also written. Tired, a bit hungry, and ready to just get home, they wonder if any of these new AI chat bots might be able to help. They rationalized it would just be used to create a first draft, and then they would go back to clean up the review critique before submitting.
Smith copied the abstract, specific aims, and research strategy sections of the applications. They uploaded to one of the AI systems that is publicly available, and widely used by many people for numerous different reasons.
A few minutes later, Ta-Da! There was some narrative that could be used for the first draft. Getting those initial critique drafts going saved hours of time.
To be clear, both of these situations are not allowed. Everybody involved with the NIH peer review process shares responsibility in maintaining and upholding the integrity of review. A breach of confidentiality, such as those described above, could lead to terminating a peer reviewer’s service, referring them for government-wide suspension or debarment, as well as possibly pursuing criminal or civil actions based on the severity. NIH guide notice NOT-OD-22-044, our Integrity and Confidentiality in NIH Peer Review page, and this NIH All About Grants podcast episode explain more.
Ensuring confidentiality means that scientists will feel comfortable sharing their candid, well-designed, and thorough research ideas with us. Generative AI tools need to be fed substantial, privileged, and detailed information to develop a peer reviewer critique on a specific grant application. Moreover, no guarantee exists explaining where AI tools send, save, view, or use grant application, contract proposal, or critique data at any time. Thus, using them absolutely violates our peer review confidentiality expectations.
NIH peer reviewers are selected and invited to review applications and proposals specifically for their individual expertise and opinion. The data that generative AI are trained on is limited to what exists, what has been widely published, and what opinions have been written for posterity. Biases are built into this data; the originality of thought that NIH values is lost and homogenized with this process and may even constitute plagiarism.
We take this issue seriously. Applicants are trusting us to protect their proprietary, sensitive, and confidential ideas from being given to others who do not have a need to know. In order to maintain this trust and keep the research process moving forward, reviewers are not allowed to share these applications with anybody or any entity.
Circling back to the beginning for a moment, we wanted to say a few words about using AI in writing one’s application. We do not know, or ask, who wrote an application. It could have come from the principal investigator, a postdoc, a professional grant writer, or involved the wider research team. If you use an AI tool to help write your application, you also do so at your own risk.
This is because when we receive a grant application, it is our understanding that it is the original idea proposed by the institution and their affiliated research team. Using AI tools may introduce several concerns related to research misconduct, like including plagiarized text from someone else’s work or fabricated citations. If we identify plagiarized, falsified, or fabricated information in a grant write-up, we will take appropriate actions to address the non-compliance. In my example above, we ran the AI-generated text through a well-known online tool which did not detect any plagiarism. Though we included it here for illustrative purposes, you should always be mindful about these concerns when putting together your application.
How does this apply to using tools like Grammarly, which are AI tools which integrate into one’s computer operating system to improve clarity of writing? Similarly, does writing a critique in Google Docs or online versions of Word, which many of us use for simplicity of backup and access to our work between home and office count as sharing confidential information with a third party? What about Dropbox? This policy somehow doesn’t seem well thought-out.
Agree.
Turnitin.
Grammarly.
Plagiarism Checker X.
Scribbr.
Quetext.
CopyLeaks.
Local data is not any more secure than cloud data. There is no such thing as “off network” anymore, unless you’re in a shielded under sea > underground bunker with it’s own power supply, security with energetic defense system and even then the software has been usurped so many times it’s hard to count. Oh… and you can’t have Windows… or IOS… so.. lol, no wonder all these “off network” people are successfully targeted.
With AI we are well positioned to significantly enhance our peer-review systems creating more comprehensive, objective, fair, and inclusive analysis of science proposals and science outputs. One appreciates the reasons for this early approach taken by the NIH. However, this is no longer required or appropriate. We should look forward to a revised policy. With the CoARA Ethics and Research Integrity Policy in Responsible Research Assessment for Data and Artificial Intelligence (ERIP) Working Group, we are developing a more positive approach to the use of AI in academia and science.
Thanks for sharing these thoughts and developing a clear policy. An in-depth exploration of the impact of using Generative AI on the scholarly peer review system, including concerns about confidentiality were discussed in an article published in May 2023 in the journal of research integrity and peer review: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10191680/
This ignores the fact that many AI models can be run entirely locally with no online, or any non-local component. The blanket statement put forth in this article is inconsistent with the facts. AI language models are not inherently a breach of confidentially (provided that they are run locally) and a blanket statement that they are is false.
Indeed, the medical center where I work has similarly banned the use of off-the-shelf LLMs for handling privileged healthcare information (PHI – medical data involving identifiable individuals) BUT has obtained some sort of commercial package that runs locally and, in the opinion of local IT people and attorneys, should not violate confidentiality. At the same time, this package costs enough so that they are restricting access to it for financial reasons, and it might be prudent to wait a while to see if the package is truly un-hackable.
I came here to say this. Self-hosted LLMs are still usable without breaching confidentiality.
I don’t doubt there are services out there that have a caveat where you can submit text for summary without it being integrated into the rest of the data that was used to train the model.
How about using a locally hosted NLM model purely for language-related tasks (not content-related) such as summarizing, outlining, correct grammar errors, etc.?
The policy is clearly based on good intentions but the rationale for the policy is seriously flawed. If confidentiality is the primary concern, then the policy authors should know that web interfaces running in the cloud is not the only mechanism to use LLMs. A lightweight LLM finetuned on QA and summarization tasks can run locally on commodity hardware. There is no breach of confidentiality. Ultimately, it boils down to the ethical standards of the reviewer.
We use a variety of tools for formatting text, generating pictures, analyze and visualizing data. We use spell checkers and Grammarly like tools.
What is the rationale for not using advanced AI as a copy-editing tool? There are many scientists with brilliant ideas, but don’t have the same skills as other to articulate them in English. Generative AI can significantly help improve our productivity. It would be a regressive policy to prohibit its use entirely.
The ultimate goal should be to advance knowledge and discovery.
I totally agree with you
This important statement focuses on confidentiality as the major ethical breach if a reviewer uses AI to develop their critique. However, just as problematic is their act of criminal fraud, because the reviewer using AI has represented their critique as an evaluation that they themselves alone developed and wrote, when in fact they did not, whether in part or in whole. There is already an existing problem of grant applicants trying to respond to invisible reviewers, when reviewers violate confidentiality and confer with their colleagues when they are reviewing applications. With AI violations, not only will there be invisible reviewers, but they will also not even be human. All said, AI fraud will be an impossible trespass for the NIH and other funding agencies to police and prevent. Perhaps NIH should run all grant applications through the same AI analysis, followed by a funding lottery for applications that score in the top half of AI ratings. A pilot investigation of such an approach compared to the same grants reviewed by a cohort of standard NIH reviewers sequestered with no access to AI tools, might show that it works just as well (or poorly for “half-empty” attitudes) as the pre-AI NIH grant review process.
As several others have pointed out, this policy seems poorly thought out and premature. I’ve discussed this very topic in my role on the publications board of the leading professional organization in my discipline. Our conclusion was that confidentiality issues can be avoided if using a private instance of an AI model–in that case, no information is shared with a third party. They could also be avoided in principle by adding a confidentiality clause in a legal agreement with an AI provider; similar agreements allow educational institutions to store and compute grades with Google Sheets, for example.
There are good reasons to be cautious when applying AI tools in a reviewing context: confidentiality needs to be maintained, and the end product must ultimately be based on the expertise of reviewers, who must stand behind the work. But a blanket prohibition is premature. I suggest going back to the drawing board and developing a more informed and nuanced policy.
As others have pointed out, this policy rationale makes little sense, regardless of one’s thoughts on whether AI should or shouldn’t be used in peer review. Its flaw is evident in the sentences: “Reviewers are trusted and required to maintain confidentiality throughout the application review process. Thus, using AI to assist in peer review would involve a breach of confidentiality.” The word “thus” is inappropriate and unjustified. Imagine writing, “Pedestrians are required to cross roads at crosswalks. Thus, using AI to for navigation would involve not crossing at crosswalks.” The word “thus” does not actually explain why AI would cause one to cross in the middle of the block; the word “thus” in the NIH post does not explain why AI necessarily involves a loss of confidentiality. Presumably the authors mean non-local AI services, but if so, that should be stated. Or do the authors mean ideas that come from unsourced others? (If that’s the case, why is confidentiality the issue?)
Your statements such as “reviewers are trusted” and “applicants are trusting us to protect their proprietary, sensitive, and confidential ideas …” seem arbitrary. Many CSR reviewers are biased and have COI, based on what they should be trusted? And applicants trust CSR or you to give them fair reviews, not, in many cases, biased or flawed reviews. And in NIH application package, there is a page to ask the applicant to indicate whether there is any proprietary, sensitive, and confidential information included. Majority of NIH applicants indicate none. So what are you really protecting?
Limited, careful use of AI chatbots can improve the quality of critique and review discussions. In some cases, I’ve found it helpful to search online for background on a relevant concept or technology to provide a better critique. An AI chat bot can assist as a modern search engine to give the reviewer a head start when necessary, without disclosing much information on the platform. This is especially helpful for multi-disciplinary applications where a single reviewer might not be completely familiar with all disciplines involved. Perhaps some use cases can be allowed and as part of pre-review meetings, SRO’s can outline allowable instances.
I would suggest checking the policies of journals such as JAMA, which seem much better thought through than this proposed ban. If the goal is confidentiality, then one should require confidentiality, not ban AI. It is fine to warn reviewers that if they aren’t careful, they may be violating confidentiality by using certain online tools, but a blanket ban is uncalled for. I hope reviewers will be running their reviews through AI to improve style, clarity, and argument strength (while, of course, respecting confidentiality)
Full disclosure: although I wrote this post, it was edited to reflect suggestions made by Grammarly, a widely used AI system.
Peer review is assessment by your competitors, each of whom has a conflict of interest. Chat has no stake in its reviews. Peer review is intrinsically unethical. Chat review is not.
I hope the team can incorporate the feedback here as the intention is good but the policy and headline, as written, are problematic. They further don’t reflect the changing reality. LLMs are able to be run on private laptops now, as others have pointed out. But I would say that we are moving as a society into assistive technology, and statements that bar it outright are going to be shortsighted and limited.
I would recommend that a more thought out statement might be to remind people not to disclose confidential content to cloud-based systems (like Chat-GPT which advises users not to share confidential information) and to ensure that reviewers do not rely on LLM output to critique material for which they are not qualified or cannot accept full responsibility. Those seeing patients in the clinic may use NLP and voice transcription technology (with proper confidentiality safeguards) to write notes (and our current EHRs are looking further at how to embed advanced LLMs into clinical work), but at the end of the day the clinician must accept responsibility for the work accuracy.
It reminds me of an article (https://pubmed.ncbi.nlm.nih.gov/21502653/) that we wrote a while ago about social media, taking a stand that there could be a professional way to use the technology appropriately, not to either jump all in or to ignore completely. The technology is going to radically reshape our work and society, so it might be worth taking some critical looks at how it will help develop science for the better, and where we will be led astray.
I fully agree with the above comments. Confidentiality is easily maintained using local LLM instances, among other methods. And LLMs far better suited than OpenAI products, are readily available with new and better ones being produced almost weekly.
It feels as though the “confidentiality” concern is a pre-textual way of not dealing with the inevitable. Why not embrace innovation (if not at the NIH, then where?) and define ways, short of plagiarism, where LLMs can or even should be used. For example, summarization is one of the original properties of transformers.
Having reviewed many grants which are sometimes poorly written and opaque to those not directly involved in the same research domain, one wonders if this rule should be flipped on its head: should applicants be REQUIRED to provide a LLM summary (edited by them, of course, for accuracy) of each section? Should at least the lay summary require LLM summarization? with the prompt “the following is a scientific grant application. Summarize it for a college freshman without a science background.”
I think we can all agree that an autonomous review (or grant application) prepared solely or mostly by an LLM is inappropriate. But there are many uses that seem simply to ease the burden on reviewers (and submitters).
What about grantees or reviewers for whom English is not their first language? If the concern is bias, are we creating bias by NOT allowing them to participate, if LLMs could ease the way?
How about IRB review? In some institutions, that can delay the science by 6 or more months. Should LLM summarization be proscribed? Or encouraged to speed the process?
LLMs are tools like many others. In the end, the goal of NIH should be accelerating science and innovation for the benefit of the community. I recall my first grant being created on an IBM Selectric and carbon paper. Better tools have become available.
There is a distinct lack of rigor in the reasoning of this position. Others have already pointed out that the key argument of being a breach of confidentiality falls apart when you consider locally-hosted LLMs. But the damage is done because the imprecisely-worded commentary has muddied the waters for reviewers, SROs, and other key players in the peer review process. The actual guide notice at least includes the word “online” in the description, but the title still betrays a poor understanding of the issue. It’s not generative AI that’s the issue, it’s the potential for transmission of privileged information to unauthorized third parties, which is unique to *online* AIs. It’s surreal to see a clickbait title in an official NIH notice.
The larger concern I have is that NIH leadership seems to have adopted this poorly reasoned, Luddite-adjacent attitude of “AI bad” as official policy. Certainly these earliest incarnations of AI have limitations that warrant care in their use, but they have already shown tremendous promise for fundamentally improving many of the most time-consuming and inefficient aspects of modern science. Humanity at large will be swallowed in its own problems without new tools to address them. AI is among the most important of these new tools.
Rather than knee-jerk proscriptions on the peer review process, I am eager to see from NIH leadership a *vision* of how they intend to develop *new tools* to help reviewers deliver higher-quality, more helpful, less biased reviews in the face of ever-rising (and unavoidably AI-accelerated) numbers of submissions. NIH should be leading the charge, providing a secure, NIH-hosted LLM that reviewers can use to augment and expedite their reviews. LLMs and other forms of AI could be used in other aspects of the review process, including reviewer assignment and detection of reviewer bias (I’ll end by saying that the implication in the article that AIs are biased but peer reviewers are not was depressingly laughable).
I read all the comments posted so far, and yes, I am accustomed to using AI assistance to create language of greater clarity in a shorter time, thus being more productive. It is important that we stay aware of false information with generative AI and the possibility of confidentiality breaches. Our responsibility remains to minimize the chances of confidentiality breaches to zero whenever possible. But not using AI assistance is not the answer, and a global ban on anything that assists with this new technology would be an overreaction. As with any new technology, there will be adjustments over time, and things will sort out.
After reading the first few sentences, I thought, “This can be done locally, as there are numerous options for safe interfaces.” And doesn’t peer review without the use of AI pose some risk to breach of confidentiality, bias, and conflict of interest?
I can only agree withe many of the comments, where there is an strong issue of privacy, then generative AI and LLM based locally are absolutely the way to go and are advancing rapidly. However, the boundaries about what is private are very blurred. Today, privacy is much more about the security of cloud services rather than keeping your sensitive data locally in perfect isolation. This website, company servers with sensitive information, the storage of organisations conducting peer review, healthcare data from a hospitals etc., likely sit on Amazon S3 cloud storage, (or Azure or Google cloud storage), located many hundreds or thousands of kilometres from the organisation premises. So the concept of privacy in the old sense needs rethinking, it is more about the data being secure across different cloud services and in different locations. Times have changed. Confidential data is already shared across computing services, it needs to be secure. Using AI in a responsible manner can, with ‘Humans in the Loop’ vastly improve the quality of work for all by ensuring that study designs and other complex designs meet established best practice and guidelines, that has to be a good thing.
This is discarding significant potential benefits as the technology advances. Who hasn’t had peer review comments along the line of “the authors do not address X” when X was discussed in a section that the reviewer didn’t see or skimmed through? Or perhaps the language used to discuss X is slightly different in their subfield? Given the density of proposals (and size, especially with clinical trial attachments), there is likely substantial benefit from allowing reviewers to “chat with” the application to check for how something may have been addressed. Due to space constraints, there are often meteorologic details discussed in the publication accompanying the preliminary data instead of at depth in the proposal. AI would, for example, allow attaching materials that reviewers are not mandated to read but the LLM can be aware of when answering the reviewer’s questions.
I’ll make some comments on this topic based on our study documented here: “ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing,” Liu and Shah, arxiv:2306.00622, June 1, 2023.
Current large language models (LLMs) can be useful for use as an assistant to human reviewers
– to check proposed methodology (in parallel to human checking)
– to verify checklists
Two notes of caution though:
– Even though one can run LLMs locally, the open source ones which currently provide downloadable models also fare very poorly in performance, whereas GPT-4 (which does not have a local option) performs well in the aforementioned tasks.
– Even GPT-4 fails quite miserably at the task of “choosing” the better work, even in simple cases, and is fooled by bombastic language or prompt injection attacks.
We appreciate the thoughtful comments. This is a rapidly evolving area. We will keep adding to our FAQs and will provide additional guidance as we learn more.
“NIH scientific peer reviewers are prohibited from using natural language processors” is incredibly broad, covers an entire universe of methods and tools, most of which are not AI/ML and wouldn’t involve any kind of information sharing or disclosure. It sounds like drafters of this guide may not fully understand NLP, AI, or generative models overall. It sounds like the actual concern is data sharing, so the policy should probably focus more on API calls, data transfer, tuning/training, etc.: exposing data rather than using a class of tools/methods.
CSR has a reporting avenue for any/all issues related to respectful interactions, bias, or anything else that could affect the fairness of the review process. Contact Dr. Gabriel Fosu, the CSR Associate Director of Diversity & Workforce Development at G.Fosu_AssocDir@csr.nih.gov.