In the past few months, the scientific community has ramped up research in response to the SARS‑CoV‑2 pandemic; dozens of peer-reviewed articles and preprints on this topic are being added to the literature every day (Figure 1). This rapidly expanding effort has created challenges for scientists and the medical community who need to analyze thousands of scholarly articles for insights on the virus.
Recently, the National Library of Medicine at NIH joined the White House and key industry and university leaders to release the COVID-19 Research Dataset (CORD-19) and call on the AI community to develop text mining tools that help analyze and summarize the over 45,000 coronavirus articles. The CORD-19 dataset represents the most comprehensive, freely available library of machine readable coronavirus scholarly literature to date, with hundreds of AI tools and technologies already created.
Building on this effort, the NIH Office of Portfolio Analysis (OPA) has assembled a comprehensive listing of COVID‑19 publications and preprints that is freely available to the public and coupled with a user-friendly portfolio analysis interface for querying the full text and supplemental data. The COVID-19 portfolio is updated daily with new literature selected for inclusion by subject matter experts. It draws upon NLM’s PubMed resource for citations and abstracts of published biomedical literature.
This new resource is designed to provide flexibility and ease-of-use for researchers. Users can take advantage of a full spectrum of Boolean, proximity, and other search methods to query full text and all available supplemental data, or they can limit their search to specific fields, including abstract, author affiliation, or last author. They can also drill down on data once a search is completed. Figure 2 shows a simple example: if a user is interested in analyzing only the peer-reviewed subset of search results, a single click on the “Peer Reviewed” option in the Source facet (circled in black, panel A) will return the desired results (panel B). Some facetable fields, including journal, article type, and author affiliation, are derived from the underlying source data; others, including chemicals & drugs, conditions, and targets, were generated by OPA. All data is downloadable as a CSV or Excel file and includes direct links to the publications and preprints. Also, all searches generate stable URLs that users can share with each other.
The tool also includes a visualization feature that groups articles into clusters based on key terms. This allows users to obtain, at a glance, the topic areas returned by their search. The clusters in the visualization are also interactive, which allows narrowing of the results to focus on a specific topic of interest. As an example, Figure 3 shows the results of a search of titles, abstracts, full text, and supplemental text with the terms “protease inhibitor” AND ritonavir (search done at 5 pm 4/14/2020); note that a plurality of the results of this search can be found on the ChemRxiv preprint server. Both the visualization image and results can be exported for downstream applications.
We invite you to explore this resource and are excited to see how the research community will use it to gain insight into the COVID-19 outbreak. OPA will continue to add publication sources and features to support the needs of users. Comments are welcome and can be provided directly through the Feedback button at the bottom left of the browser in the tool.