A scoping review of large language model based approaches for information extraction from radiology reports

Nature

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

ABSTRACT Radiological imaging is a globally prevalent diagnostic method, yet the free text contained in radiology reports is not frequently used for secondary purposes. Natural Language

Processing can provide structured data retrieved from these reports. This paper provides a summary of the current state of research on Large Language Model (LLM) based approaches for

information extraction (IE) from radiology reports. We conduct a scoping review that follows the PRISMA-ScR guideline. Queries of five databases were conducted on August 1st 2023. Among the

34 studies that met inclusion criteria, only pre-transformer and encoder-based models are described. External validation shows a general performance decrease, although LLMs might improve

generalizability of IE approaches. Reports related to CT and MRI examinations, as well as thoracic reports, prevail. Most common challenges reported are missing validation on external data

and augmentation of the described methods. Different reporting granularities affect the comparability and transparency of approaches. SIMILAR CONTENT BEING VIEWED BY OTHERS INFORMATION

EXTRACTION FROM GERMAN RADIOLOGICAL REPORTS FOR GENERAL CLINICAL TEXT AND LANGUAGE UNDERSTANDING Article Open access 09 February 2023 ITERATIVE REFINEMENT AND GOAL ARTICULATION TO OPTIMIZE

LARGE LANGUAGE MODELS FOR CLINICAL INFORMATION EXTRACTION Article Open access 23 May 2025 DEVELOPMENT AND VALIDATION OF A NOVEL AI FRAMEWORK USING NLP WITH LLM INTEGRATION FOR RELEVANT

CLINICAL DATA EXTRACTION THROUGH AUTOMATED CHART REVIEW Article Open access 05 November 2024 INTRODUCTION In contemporary medicine, diagnostic tests, particularly various forms of

radiological imaging, are vital for informed decision-making1. Radiologists create for image examinations semi-structured free-text radiology reports by dictation, sticking to a personal or

institutional schema to organize the information contained. Structured reporting that is only used in few institutions and for specific cases on the other hand offers a possibility to

enhance automatic analysis of reports by defining standardized report layouts and contents. Despite the potential benefits of structured reporting in radiology, its implementation often

encounters resistance due to the possible temporary increase in radiologists’ workload, rendering the integration into clinical practice challenging2. Natural language processing (NLP) can

provide the means to make structured information available by maintaining existing documentation procedures. NLP is defined as “tract of artificial intelligence and linguistics, devoted to

making computers understand the statements or words written in human languages”3. Applied on radiology reports, methods related to NLP can extract clinically relevant information.

Specifically, information extraction (IE) provides techniques to use this clinical information for secondary purposes, such as prediction, quality assurance or research. IE, a subfield

within NLP, involves extracting pertinent information from free-text. Subtasks include named entity recognition (NER), relation extraction (RE), and template filling. These subtasks are

realized using heuristic-based methods, machine learning-based techniques (e.g., support vector machines or Naıve Bayes), and deep learning-based methods4. Within the field of deep learning,

a new architecture of models has recently emerged - namely large language models (LLMs). LLMs are “deep learning models with a huge number of parameters trained in an unsupervised way on

large volumes of text”5. These models typically exceed one million parameters and have proven highly effective in information extraction tasks. The transformer architecture, introduced in

2017, serves as the foundation for most contemporary LLMs, comprising two distinct architectural blocks; the encoder and the decoder. Both blocks apply an innovative approach of creating

contextualized word embeddings called attention6. Prior to the “age of transformers” still present today, recurrent neural network (RNN)-based LLMs were regarded as state-of-the-art for

creating contextualized word embeddings. ELMo, a language model based on a bidirectional Long Short Term Memory (BiLSTM) network7, is an example thereof. Noteworthy transformer-based LLMs

include encoder-based models like BERT (2018)8, decoder-based models like GPT-3 (2020)9 and GPT-4 (2023)10, as well as models applying both encoder and decocoder blocks, e.g., Megatron-ML

(2019)11. Models continue to evolve, being trained on expanding datasets and consistently surpassing the performance benchmarks established by previous state-of-the-art models. The question

arises how these new models shape IE applied to radiology reports. Regarding existing literature concerning IE from radiology reports, several reviews are available, although these sources

either miss current developments or only focus on a specific aspect or clinical domain, see Table 1. The application of NLP to radiology reports for IE has already been subject to two

systematic reviews in 201612 and 202113. While the former is not freely available, the latter searches only Google Scholar and includes only one study based on LLMs. Davidson et al. focused

on comparing the quality of studies applying NLP-related methods to radiology reports14. More recent reviews include a specific scoping review on the application of NLP to reports

specifically related to breast cancer15, the extraction of cancer concepts from clinical notes16, and a systematic review on BERT-based NLP applications in radiology without a specific focus

on information extraction17. As LLMs have only recently gained a strong momentum, a research gap exists as there is no overview of LLM-based approaches for IE from radiology reports

available. With this scoping review, we therefore intend to answer the following research question: _What is the state of research regarding information extraction from free-text radiology

reports based on LLMs?_ Specifically, we are interested in the subquestions that arise from the posed research question: * RQ.01 - Performance: What is the performance of LLMs for

information extraction from radiology reports? * RQ.02 - Training and Modeling: Which models are used and how is the pre-training and fine-tuning process designed? * RQ.03 - Use cases: Which

modalities and anatomical regions do the analyzed reports correspond to? * RQ.04 - Data and annotation: How much data was used to train the model, how was the annotation process designed

and is the data publicly available? * RQ.05 - Challenges: What are open challenges and common limitations of existing approaches? The objective of this scoping review is to answer the

above-mentioned questions, provide an overview of recent developments, identify key trends and highlight future research by identifying outstanding challenges and limitations of current

approaches. RESULTS STUDY SELECTION As shown in Fig. 1, the systematic search yielded 1,237 records, retrieved from five databases. After removing duplicate records and records published

before 2018, 374 records (title, abstract) were screened for eligibility. The screening process resulted in the exclusion of 302 records. The remaining 72 records were sought for

full-text-retrieval, of which 68 could be retrieved. During data extraction, 43 papers were excluded due to not fulfilling inclusion criteria, which was not apparent based on information

provided in the abstract. Within the cited references of included papers, nine additional papers fulfilling all inclusion criteria were identified. Therefore, following the above-mentioned

methodology, 34 records in total were included in this review. STUDY CHARACTERISTICS In the following, we organize the extracted information according to the structure of the extraction

table, which in turn reflects the defined research questions. This review covers studies that were published between 01/01/2018 and 01/08/2023. The earliest study included was published in

2019. After eight included studies published in 2020, the topic reaches its peak with eleven studies published in 2021. Eight studies of 2022 were included. Six included studies were

published in the first half of 2023. Based on corresponding author address, 15 out of 35 papers are located in the USA, followed by six in China and three each in the UK and Germany. Other

countries include Austria (_n_ = 1), Canada (_n_ = 2), Japan (_n_ = 2), Spain (_n_ = 1) and The Netherlands (_n_ = 1) (Table 2). EXTRACTED INFORMATION This chapter describes the NLP task,

the extracted entities, the information model development process and data normalization strategies of the included studies. Extracted concepts encompass various entities, attributes, and

relations. These concepts relate to abnormalities18,19,20, anatomical information21, breast-cancer related concepts22, clinical findings23,24,25, devices26, diagnoses27,28, observations29,

pathological concepts30, protected health information (PHI)31, recommendations32, scores (TI-RADS33, tumor response categories34), spatial expressions35,36,37, staging-related

information38,39, and stroke phenotypes40. Several papers extract various concepts, e.g., ref. 41. Studies solely describing document-level single-label classification were excluded from

this review. Two studies apply document-level multi-class classification. Document-level multi-label classification is described in nine studies (26%), whereof three only classify more than

two classes for each entity. The majority of the included studies (_n_ = 21, 62%) describes NER methods, ten studies additionally apply RE methods. These studies encompass sequence-labeling

and span-labeling approaches. Question answering (QA)-based methods are described in two studies, see Fig. 2. The number of extracted concepts (including entities, attributes, and relations)

ranges from one entity in both papers describing multi-class classification33,34 up to 64 entities described in a NER-based study30. Three studies base their information model on clinical

guidelines, namely the _Response evaluation criteria in solid tumors_42 and the _TNM Classification of Malignant Tumors_ (TNM) staging system43. Development by domain experts (_n_ = 2),

references to previous studies (_n_ = 3), regulations of the Health Insurance Portability and Accountability Act44 (_n_ = 1), the Stanza radiology model45 (_n_ = 1) and references to

previously developed schemes (_n_ = 2) are other foundations for information model development. One study provides detailed information about the development process of the information model

as supplementary information19. One study reports development of their information model based on the RadLex terminology46, another based on the National Cancer Institute Thesaurus47. 21

studies (62%) do not report any details regarding the development of the information model. Out of the 34 included studies, only three describe methods to structure and/or normalize

extracted information. While Torres-Lopez et al. apply rule-based methods to structure extracted data based on entity positions and combinations30, Sugimoto et al. additionally apply

rule-based normalization based on a concept table24. Datta et al. describe a hybrid approach to normalize extracted entities by first generating concept candidates with BM25, a ranking

algorithm, and then choosing the best equivalent with a BERT-based classifier48. Regarding the distribution of annotated entities within the datasets, only one study reports on having

conducted measures to counteract class imbalance19. Another study reports on not having used F1 score as a performance measure, as the F1 score is not suited when class imbalances are

present27. Four studies (12%) report coarse entity distributions and seven studies (21%) describe granular entity distributions. MODEL In the following, details regarding the reported model

architectures and implementations are described, including base models, (further) pre-training and fine-tuning methods, hyperparameters, performance measures, external validation and

hardware details. For an overview of applied model architectures, see Table 3. 28 out of 34 papers (82%) describe at least one transformer-based architecture, while the remaining six studies

apply various adaptions of the Bidirectional Long Short-Term Memory (Bi-LSTM) architecture. Out of the 28 studies that describe transformer-based architectures, 27 are based on the BERT

architecture8 and one is based on the ERNIE architecture49. Eight studies (24%) describe further pre-training of a BERT-based, pre-trained model on in-house data. Eighteen studies (53%) use

a BERT-based, pre-trained model without further pre-training. One study applies pre-training to other layers than the LLM. Two studies do not provide any details regarding the architecture

of the BERT models. One study combines both BERT- and BiLSTM-based architectures28. Out of six studies that describe only BiLSTM-based architectures, two studies apply pre-training of word

vectors based on word2vec50. 31 studies (91%) provide sufficient details about the fine-tuning process. Three studies do not provide details24,39,51. Reported performance measures vary

between included studies, including traditional measures like precision, recall, and accuracy as well as different variations of the F1 score (micro, macro, averaged, weighted, pooled). The

performance of studies reporting a F1-score variation (including micro-, macro-, pooled- generalized, exact match and weighted F1) is compared in Table 4. If a study describes multiple

models, the score of the best model was chosen. If two or more datasets are compared, the higher score was chosen. If applicable, the result of external validation is also presented. 22

studies (65%) report having conducted statistical tests, including cross-validation, McNemar test, Mann-Whitney _U_ test and Tukey-Kramer test. Hyperparameters used to train the models

(e.g., learning rate, batch size, embedding dimensions) are described in 28 studies (82%), however with varying degree of detail. Six studies (18%) do not report any details on

hyperparameters. Seven studies (21%) describe a validation of their algorithm on training data from an external institution. Seven studies (21%) include details about hardware and

computational resources spent during the training process. DATA SETS In this section, we describe the study characteristics related to data sets, encompassing number of reports, data splits,

modalities, anatomic regions, origin, language, and ethics approval. Data set size used for fine-tuning ranges from 50 to 10,155 reports. The amount of external validation data ranges from

10% to 31% of the amount of data used for fine-tuning. For further pre-training of transformer-based architectures, 50,000 up to 3.8 million reports are used. Jantscher et al. additionally

use the content of a public clinical knowledge platform (_DocCheck Flexicon_52)53. Zhang et al. only report the amount of data (3 GB)54. Jaiswal et al. performed further pre-training on the

complete MIMIC-CXR corpus29. Two studies that described pre-training of word embeddings for Bi-LSTM-based architectures used 3.3 million and 317,130 reports, respectively24,32. Data splits

vary widely; the majority of studies (_n_ = 23, 68%) divide their data into three sets, namely train-, validation- and test-set, with the most common split being 80/10/10, respectively. This

split variation is reported in eight studies (24%). Seven studies (21%) use two sets only, four studies (12%) apply cross-validation-based methods. 19 studies (56%) describe the timeframe

within which reports had been extracted. Dada et al. report the longest timeframe of 22 years, using reports between 1999 and 2021 for further pre-training41. The shortest timeframe reported

is less than one year (2020–2021)26. Several studies are based on publicly available datasets: MIMIC-CXR55 was used once29 while MIMIC56 was used by two studies40,57. MIMIC-III58 was used

by six studies (18%)37,40,48,57,59,60. The Indiana chest X-ray collection61 was used twice35,36. For external validation, MIMIC-II was applied by Mithun et al.62 and MIMIC-CXR by Lau et

al.23. While some of these studies use the datasets as-is, some perform additional annotation. Other studies use data from hospitals, hospital networks, other tertiary care institutions,

medical big data companies, research centers, care centers or university research repositories. Figures 3 and 4 show the frequencies of modalities and anatomical regions, respectively. Note

that frequencies were counted on study-level and not weighted by the number of reports. Report language was inferred from the location of the institution of the corresponding author: Most

studies use English reports (_n_ = 21, 62%) followed by Chinese (_n_ = 6, 18%), German (_n_ = 4, 12%), Japanese (_n_ = 2, 6%) and Spanish (_n_ = 1). The corresponding author address of one

study is located in the Netherlands but using data from an Indian Hospital62. 19 studies (56%) explicitly state that the endeavor was approved by either a national committee or agency (_n_ =

3, 9%) or a local institutional or hospital review board or committee (_n_ = 15, 44%). One study reports approval only for in-house data, but not for the external validation set from

another institution33. ANNOTATION PROCESS 28 studies (82%) describe an exclusively manual annotation process. Five studies (15%) explicitly state that each report was annotated by two

persons independently. Lau et al. use annotated data to train a classifier that supports the annotation process by proposing only documents that contain potential annotations32. Two studies

use tools for automated annotation with manual correction and review29,31. Lybarger et al. do not provide details on their augmentation of an existing dataset21, three others do not report

details as they either extract information available in the hospital information system33 or exclusively use existing annotated datasets36,59. Annotation tagging schemes mentioned include

IOB(2), BISO and BIOES (short for beginning, inside, outside, start, end). The number of involved annotators ranges from one to five, roles include clinical coordinators, radiologists,

radiology residents, medical and graduate students, medical informatics engineers, neurologists, neuro-radiologists, surgeons, radiological technologists and internists. Existing annotation

guidelines are reported by three studies, four studies mention that instructions exist but do not provide details. 23 studies (68%) do not mention information regarding annotation

guidelines. Inter-annotator-agreement (IAA) is reported by 23 (68%) studies. Measures include F1 score variants (_n_ = 8, 24%), Cohen kappa (_n_ = 7, 21%), Fleiss kappa (_n_ = 19, 56%) and

the intraclass correlation coefficient (_n_ = 1). IAA results are reported by 16 studies (47%) and range, for Cohen kappa, from 81% to 93.7%. Eleven studies (32%) mention the tool used for

annotation, including Brat23,37,39,48,53,60, Doccano34, TagEditor30, Talen46 and two self-developed tools19,63. DATA AND SOURCE CODE AVAILABILITY Five studies (15%) state that data is

available upon request. One study claims availability, although there is no data present in the referenced online repository57. One study published its dataset in a GitHub repository35. One

study only uses annotations provided within a dataset with credentialed access59. The remaining 22 studies (65%) do not mention whether data is available or not. Regarding source code

availability, ten studies (29%) claim their code to be available. The remaining 24 studies (71%) do not mention whether the source code is available or not. CHALLENGES AND LIMITATIONS

Various aspects related to limitations and challenges are described. The most common mentioned limitation is that studies use only data from a single institution21,22,24,30,36,51,53.

Similarly, multiple studies mention validation on external or multi-institutional data as a future research direction19,26,59. Two studies mention the need of semantic enrichment or

normalization of extracted information48,54. Many studies report intentions to augment their described approaches to other report types21,28,30,37, other report sections22, to include other

or more data sources35,39,54 or entities32,62, body parts46, clinical contexts34 or modalities35,53,59. Additional limitations include the application to only a single modality or clinical

area21,46,53, small dataset size27,32,54, technical limitations27,63, no negation detection35,62, few extracted entities24,28 or result degradation upon evaluation on external data19 or more

recent reports25. Missing interpretability is mentioned by two studies28,41. DISCUSSION Performance measures reported in Table 4 cannot be compared due to differences in datasets, number of

extracted concepts and the heterogeneity of applied performance measures. External validation performed by six studies shows in general lower performance of the algorithm applied to

external data, so data from a source different from the one used for training. The largest performance drop of 35% (overall F1 score) was reported in a Bi-LSTM-based study, performing

multi-label binary classification of only three entities on the document-level62. On the contrary, Torres-Lopez et al. extracted a total of 64 entities with a performance drop of only 3.16%

(F1 score), although not providing details on their model architecture. The smallest performance drop amounts to only 0.74% (Micro F1) for extracting seven entities based on a further

pre-trained model46. However, it cannot be assumed that further pre-training increases model generalizability and therefore performance. Upon analysis of performance, several inconsistencies

between included studies impairs comparability: First, there is no standardized measure or best-practice to assess model performance for information extraction. Although in general, the F1

score is most often applied and well known, there exist many variations, including micro-, macro-, exact and inexact match scores, weighted F1 score and 1-Margin F1 scores. On the contrary,

Zaman et al. argue that macro-averaged F1 score or overall accuracy are not suited as performance measures when class imbalances are present27. For the same reason, F1 score is only used to

assess binary classification and not for multi-class classification by Wood et al.19. While 22 studies apply some variation of cross-validation to assess model performance, 12 studies apply

simple split validation methods. Singh et al. show that if data sets are small, simple split validation shows significant differences of performance measures compared to cross-validation64.

Specific statistical tests to compare performance of different models include DeLong’s test to compare Area under the ROC Curves19,27, the Tukey-Kramer method for multiple comparison

analysis46 and the McNemar test to compare the agreement between two models22. However, appropriateness of each test method remains unclear, as shown by Demner et al.65. In general,

equations on how performance metrics are computed should always be included in the manuscript to improve understandability, e.g., as done by22 or30. To improve comparability of studies,

scores for each class as well as a reasonable aggregated score over all classes should be reported. This review identified only decoder-based architectures or pre-transformer architectures

and no generative models, such as GPT-4 (released in March 2023). The majority of the described models is based on the encoder-only BERT architecture, first described by Devlin et al.8. We

envision multiple reasons: First, while having been available since 201866, generative models first needed time to be established as a new technology to be investigated and applied in the

healthcare sector. Second, early generative models might have demonstrated poor performance due to their relatively small size and lack of domain-specific data for pre-training67. Third,

poor performance might also entail model hallucinations: Farquhar et al. define hallucination as “answering unreliably or without necessary information”68. Hallucinations include, among

others, provision of wrong answers due to erroneous training data, lying in pursuit of a reward or errors related to reasoning and generalization68. On the contrary, encoder-only models like

the BERT architecture cannot hallucinate as they provide only context-aware embeddings of input data; the actual NLP task (e.g., sequence labeling, classification or regression) is

performed by a relatively simple, downstream neural network, rendering this architecture more transparent and verifiable than generative models. An advantage of LLMs is their capability to

be customized to a specific language or general domain (e.g., medicine): First, a base version of the model is trained using a large amount of unlabeled data: This process is called

pre-training. The concept of transfer-learning enables researchers to further customize a pre-trained model to a more specific domain (e.g., clinical domain, another language or from a

certain hospital). This is also referred to as further pre-training. The process of training the model to perform a particular NLP task (e.g., classification) based on labeled data is called

fine-tuning. These definitions (pre-training, further pre-training, transfer learning and fine-tuning) tend to be confused by authors or replaced by other term variants, e.g., “supervised

learning”. However, it is imperative to use clear and concise language to distinguish between the concepts mentioned above. Seven included studies apply further pre-training as defined

above. The effect of further pre-training depends on various factors, including specifications of the input model used or amount and quality of the data used for further pre-training.

Interestingly, further pre-training of a pre-trained model to another language was not reported. Opposed to the traditional further pre-training as described above, Jaiswal et al. show how

BERT-based models achieve higher performance when little data is available based on contrastive pre-training29. The authors claim that their model achieves better results than conventional

transformers when the number of annotated reports is limited. Only two studies solve the task of information extraction based on extractive question answering41,59. Extractive question

answering was already described in the original BERT paper8: Instead of generating a pooled embedding of the input text or one embedding per input token, a BERT model fine-tuned for question

answering takes an answer as an input and outputs the start and end token of the text span that contains the answer to the posed question - this is also possible if no answer or multiple

answers are contained within the text as shown by Zhang et al.69. The most common modalities for which reports of findings were used in the included studies are CT (_n_ = 16), MRI (_n_ = 15)

and X-Ray (_n_ = 14). CT reports appear to be the most common source when using in-house data. According to data provided by the Organisation for Economic Cooperation and Development

(OECD), the availability of CT scanners and MRI machines has increased steadily during the past decades. Furthermore, there has been a general upwards trend in the number of performed CT and

MRI interventions worldwide70. CT exams are fast and cheap compared to MRI. The most common anatomical regions studied are thorax (_n_ = 17) and brain (_n_ = 8). There might be different

reasons for this distribution. First, chest X-Ray is one of the most frequently performed imaging examinations. Second, six studies used reports obtained from MIMIC datasets, including

thorax X-Ray, brain MRI and babygram examinations. Two studies used thorax X-Ray reports obtained from publicly available datasets. Furthermore, a report on the annual exposure from medical

imaging in Switzerland shows that the thorax region is the third most common anatomical region of CT procedures (11.8%), preceded by abdomen and thorax (16.4%) and abdomen only (17.7%)71. We

identified several aspects that showed different interpretations in the included studies. One of the major ambiguities discovered is the clear definition of the terms test set and

validation set: Some studies use these two very distinct terms interchangeably. However, agreement is needed upon which set is used during parameter optimization of a model and which set is

used for evaluation of the final model. Furthermore, studies either report number of sentences or number of documents, hindering comparability. It also remains unclear, whether the stated

dataset size includes documents without annotation or annotated data only. Report language is never explicitly stated. Regarding annotation, it becomes apparent that there is no standard for

IAA calculation, recommended number of annotator and their backgrounds, number of reports, number of reconciliation rounds and especially, IAA calculation methods. All these aspects differ

widely in the included papers. Good practices observed in the included papers include reporting of descriptive annotation statistics35 and conducting complexity analysis of the report

corpus29,34: These complexity metrics include e.g., unique n-gram counts, lexical diversity as measured with the Yule 1 score and the Type-Token-Ratio, as reported in ref. 46. Wood et al.

highlight the importance of splitting data on patient-level instead of report level19. Last, we want to highlight interesting approaches: Fine et al. first use structured reports for

fine-tuning and then apply the resulting model on unstructured reports34. Jaiswal et al. introduce three novel data augmentation techniques before fine-tuning their model based on

contrastive learning29. Pérez-Díez et al. developed a randomization algorithm to substitute detected entities with synthetic alternatives to disguise undetected personal information31. The

mentioned challenges and limitations are manifold and diverse. Ten papers in total address the topic of generalizing to data from other institutions. Another challenge are the limitations of

every study, be it a limited number of entities and usually a single modality and clinical domain. Every included study is based on a pre-defined information model and fine-tuned on

annotated data. This means, that by August 2023, no truly generalized approach for IE has been described in the identified literature. Upon interpretation of the above-mentioned results,

several limitations of this review can be mentioned. First, the definition of _information extraction_ proved to be challenging. We defined information extraction as a collective term for

the NLP tasks of document-level multi-label classification (including binary or multiple classes for each label), NER (including RE), as well as question answering approaches. We excluded

binary classification on the document level. While a narrow definition of IE would possibly only include NER and RE, whereas the widest definition would also include binary document

classification. With our approach, we wanted to ensure a balanced level of task complexity. Furthermore, the definition of an LLM was also unclear. In the protocol for this review, LLMs are

defined as “deep learning models with more than one million parameters, trained on unlabeled text data”72. Although BiLSTM-based architectures are not trained on text, the applied

context-aware word embeddings like fastText and word2vec stipulate the inclusion of these architectures into this review. An additional argument for including BiLSTM-based architectures is

ELMO, a BiLSTM-based architecture with ~ 13M parameters, and referred to as one of the first LLMs. However, we decided not to include BiGRU-based architectures, as information on their

parameter count was usually not available. A more narrow definition would only include transformer-based architectures, having billions of parameters. This definition seems to have recently

reached consensus among researchers and in industry. As of the time of submission in June 2024, LLMs tend to be defined even more narrow, only including generative models based on

autoregressive sampling73. This might be due to generative models currently being the most common and frequent model architecture. On the contrary, a wider definition would also potentially

include BiGRU-based, CNN-based and other architectures. It also remains subject to discussion whether summarization can be regarded as information extraction—for this study, summarization

was not included, potentially missing studies of interest, e.g., ref. 74. Likewise, image-to-text report generation was excluded. Regarding the search strategy, we decided not to include

numerous model names to keep the complexity of the search term low. Instead, we initially only included the terms _transformers_ and _Bert_. Eventually, only two search dimensions were used

because otherwise, the number of search results would have been too small. To minimize the number of missed studies, the forward search of references of included studies was carried out,

eventually leading to nine additionally included studies that were not covered by the search strategy. Nevertheless, our search strategy was not exhaustive: Studies that used terms related

to _transformation_ or _structuring_ of reports, e.g.,refs. 75,76, were missed as these terms are missing in the search strategy. No generative models and therefore no approaches based on

generative models (including few-, single- or zero-shot learning) are included in the search results. This might be due to the fact that generative models have only started to become widely

accessible with the publication of chatGPT in November 2022. Only later, open-source alternatives became available. However, due to the sensitive nature of patient data, utilization of

publicly serviced models, e.g., GPT-4, is restricted due to data protection rules. Until the cut-off time of this review, state-of-the-art, open-source generative models, e.g., LLama 2

(70B), had still required vast computational resources, restricting the possibilities of on-premise deployment within hospital infrastructures. Furthermore, early studies might so far only

be published without peer-review (e.g., on arXiv), excluding them for this review, e.g., ref. 77. As no search updates were performed for this review, arXiv papers that were later

peer-reviewed were also not included, e.g.,78. Relevant papers published in the ACL Anthology were also not included, potentially missing papers describing generative approaches, e.g., by

Agrawal et al.79 and Kartchner et al.80. Sources that did not mention “information extraction”, “named entity recognition” or “relation extraction” in the title or abstract and were not

referred to by other papers were also not included, e.g., ref. 81. Given the diverse nature of the included studies alongside discrepancies in both the quality and quantity of reported data,

a comprehensive analysis of the extracted information was deemed impossible. Future systematic reviews could enhance this comparison by refining the research question and subquestions to a

more specific scope. However, according to the protocol for this scoping review, a purely descriptive presentation of findings was conducted. Another potential limitation is the fact that

data extraction was performed by one author (DR) only. However, prior to data extraction, two studies were extracted by two authors, and the resulting information compared. This led to the

addition of six additional aspects to the original data extraction table, including details on hardware specification, hyperparameters, ethical approval, timeframe of dataset and class

imbalance measures. Last, we want to highlight that this scoping review strictly adheres to the PRISMA-ScR and PRISMA-S guidelines. Our search strategy of five databases resulted in over

1200 primary search results, minimizing the risk of missing relevant studies. This risk was further minimized by carefully choosing a balanced definition of both IE and LLMs. As only

peer-reviewed studies were taken into account, a certain study quality was furthermore ensured. Due to the current rapid technical progress, we summarize the latest developments regarding

LLMs in general, their application in medicine, as well with regard to this review’s topic. We give an overview on studies published outside the scope of our review (published after August

1st 2023) as well as on the application of LLMs in clinical domains and tasks different from IE from radiology reports. As of June 2024, the majority of recently published LLMs, be it

commercial or open-source, are generative models, based on the decoder-block of the original transformer architecture. Two development strategies can be observed to increase model

performance: The first strategy is about simply increasing the amount of model parameters (and therefore, model size), leading also to an increased demand for training data. The second

strategy, on the other hand, is about optimizing existing models based on different strategies, including model pruning, quantization or distillation, as shown by Rohanian et al.82. Recent

models include the Gemini family (2024)83, the T5 family84, LLama 3 (2024)85 and Mixtral (2024)86. Moreover, research has increasingly been focussing on developing domain-specific models,

e.g., Meditron, Med-PaLM 2, or Med-Gemini for the healthcare domain87,88,89. In the broad clinical domain, these recent, generative LLMs show impressive capabilities, partly outperforming

clinicians in test settings regarding, e.g., medical summary generation90, prediction of clinical outcomes91 and answering of clinical questions92. Dagdelen et al. have recently demonstrated

that, in the context of structured information extraction from scientific texts, even generative models require a few hundred training examples to effectively extract and organize

information using the open-source model Llama-293. For the specific topic of structured IE from radiology reports, several papers and pre-prints have been published since August 2023: In

general, it becomes apparent that resource-demanding generative models seem not to show better results compared to encoder-based approaches, as shown by the following studies: When applying

the open-source model Vicuna94 to binary label 13 concepts on document-level of radiology reports, Mukherjee et al. showed only moderate to substantial agreement with existing, less

resource-demanding approaches95. Document-level binary level was also investigated by Adams et al., who compared GPT-4 to a BERT-based model further pre-trained on German medical

documents75. In this comparison, the smaller, open-source model96 outperformed GPT-4 for five out of nine concepts. The authors also tested GPT-4 on English radiology reports, however not

providing any detailed performance measures. Similarily, Hu et al. used ChatGPT as a commercial platform to extract eleven concepts from radiology reports without further fine-tuning or

provision of examples97. The results show inferiority of ChatGPT upon comparison with a previously described approach (BERT-based multiturn question answering98) as well as a rule-based

approach (averaged F1 scores: 0.88, 0.91, 0.93, respectively). Mallio et al. qualitatively compared several closed-source generative LLMs for structured reporting, although lacking clear

results99. Additionally, several key gaps remain with the application of above-mentioned generative models. For example, closed-source models continue getting larger, requiring an increasing

extent of scarce hardware resources and training data. Moreover, although large generative models currently show the best performance, they are less explainable than, e.g., encoder-based

architectures prevalent in this review’s results100. Generative models and encoder-based models each offer unique advantages and disadvantages. Yang et al. show that generative models might

excel at generalizing to external data by applying in-context learning101. Generative models are by design able to aggregate information, and might be therefore more suitable to extract more

complex concepts. Recently, open-source models are becoming more efficient and compact, as seen in recent advancements, e.g., the Phi 3 model family102. However, generative models are

usually computationally intensive and require substantial resources for training and deployment. While still facing issues regarding hallucination, this behavior might be improved by

combining LLMs with knowledge graphs, as introduced by Gilbert et al.103. On the other hand, encoder-based models, such as BERT, are highly effective at understanding and generating

bidirectional contextual embeddings of input data, which makes them particularly strong in tasks requiring precise comprehension or annotation of text, such as extractive question answering

or NER. They tend to be more resource-efficient during inference compared to generative models. However, encoder-based models often struggle with generating coherent text, a task where

generative models excel. Additionally, while encoder-based models can be fine-tuned for specific tasks, they may not generalize as well as generative models. Moreover, research and industry

currently focus on the development of generative models, as the last encoder-based architecture was published in 2021104. In summary, while generative models currently offer flexibility and

powerful aggregation capabilities, encoder-based models provide efficiency and precision. In this review, we provide a comprehensive overview of recent studies on LLM-based information

extraction from radiology reports, published between January 2018 and August 2023. No generative model architectures for IE from radiology reports were described in literature. After August

2023, generative models have been becoming more common, however tending not to show a performance increase compared to pre-transformer and encoder-based architectures. According to the

included studies, pre-transformer and encoder-based models show promising results, although comparison is hindered by different performance score calculation methods and vastly different

data sets and tasks. LLMs might improve generalizability of IE methods, although external validation is performed in only seven studies. The majority of studies used pre-trained LLMs without

further pre-training on their own data. So far, research has focused on IE from reports related to CT and MRI examinations and most frequently on reports related to the thorax region. We

recognize a lack of publicly available datasets. Furthermore, a lack of standardization of the annotation process results in potential differences regarding data quality. The source code is

made available by only ten studies, limiting reproducibility of the described methods. Most common challenges reported are missing validation on external data and augmentation of the

described method to other clinical domains, report types, concepts, modalities and anatomical regions. No generative model architectures for IE from radiology reports were described in

literature. After August 2023, generative models have been becoming more common, however tending not to show a performance increase compared to pre-transformer and encoder-based

architectures. According to the included studies, pre-transformer and encoder-based models show promising results, although comparison is hindered by different performance score calculation

methods and vastly different data sets and tasks. LLMs might improve generalizability of IE methods, although external validation is performed in only seven studies. We conclude by

highlighting the need to facilitate comparability of studies and to review generative AI-based approaches. We therefore plan to develop a reporting framework for clinical application of NLP

methods. This need is confirmed by Davidson et al. who also state that available guidelines are limited14; journal-specific guidelines already exist105. Considering the periodical

publication of larger, more capable generative models, transparent and verifiable reporting of all aspects described in this review is essential to compare and identify successful

approaches. We furthermore suggest future research to focus on the optimization and standardization of annotation processes to develop few-shot prompts. Currently, the correlation between

annotation quality, quantity and model performance is unknown. Last, we recommend the development and publication of standardized, multilingual datasets to foster external validation of

models. METHODS This scoping review was conducted according to the JBI Manual for evidence synthesis and adheres to the PRISMA extension for scoping reviews (PRISMA-ScR). Regarding

methodological details, we refer to the published protocol for this review72. In this section, we give an overview on the applied methodology and explain the adaptations made to the

protocol. The completed PRISMA-ScR checklist is provided in Supplementary Table 1. SEARCH STRATEGY The search strategy comprised three steps: First, a preliminary search was conducted by

searching two databases (Google Scholar and PubMed), using keywords related to this review’s research question. Based on the results, a list of relevant search and index terms was retrieved,

which in turn served as a basis for the iterative development of the full search query. During search query development, different combinations of terms and dimensions of the research topic

were combined to build query combinations that were run on PubMed. Balancing of search results and relevance showed that the inclusion of only two dimensions, “radiology” and “information

extraction”, showed the best balance regarding the quantity and quality of results and was therefore chosen as the final search query. Second, a systematic search was carried out using the

final version of the search query. The PubMed-based query was adapted to meet syntactical requirements of the other four databases, comprising IEEE Xplore, ACM Digital Library, Web of

Science Core Collection and Embase. The systematic search was conducted on 01/08/2023, and included all sources of evidence (SOE) since database inception. No additional limits,

restrictions, or filters were applied. The full query for each database as well as a completed PRISMA-S extension checklist are shown in Supplementary Table 2 and Supplementary Table 3.

Third, reference lists of included studies were manually checked for additional sources of evidence and included if fulfilling all inclusion criteria. No search updates were performed.

INCLUSION CRITERIA Inclusion criteria were discussed among and agreed on by all three authors. No separation was made between exclusion and inclusion criteria; reports were included upon

fulfillment of all the following six aspects: * C.01: The full-text SOE is retrievable. * C.02: The SOE was published after 31/12/2017. * C.03: The SOE is published in a peer-reviewed

journal or conference proceeding. * C.04: The SOE describes original research, excluding reviews, comments, patents and white papers. * C.05: The SOE describes the application of NLP methods

for the purpose of IE from free-text radiology reports. * C.06: The described approach is LLM-based (defined as deep learning models with more than one million parameters, trained on

unlabeled text data). SCREENING AND DATA EXTRACTION Record screening was performed by two authors (KD, DR), using the online-platform Rayyan106. To improve alignment regarding inclusion

criteria between reviewers, a first batch of 25 records was screened individually. Two conflicting decisions were discussed and clarified, leading to the consensus that BiLSTM-based

architectures might also classify as LLMs and should therefore be included. In order to validate this change, a second batch of 25 records was screened and compared. Three conflicting

decisions helped to clarify that, when a LLM-based architecture is not explicitly stated in the title or abstract, the record should still be marked as included to maximize overall recall of

relevant papers. Upon clarification of the inclusion criteria, each remaining record (title, abstract) was screened twice. After completion of the screening process, conflicts (comprising

differing decisions or records marked as “maybe”) were resolved by including all records that are marked at least once as “included”. After screening, records were sought for full-text

retrieval. Data extraction was performed by one author (DR). During the extraction phase, reports were ex post excluded when a violation of inclusion criteria became apparent from the

full-text. Reference lists of included papers were screened for further reports to include. Changes to the published protocol for this review are documented in Supplementary Table 4,

including its description, reason, and date. DATA AVAILABILITY The complete list of extracted documents for all queried databases as well as the completed data extraction table are available

in the OSF repository, see https://doi.org/10.17605/OSF.IO/RWU5M. CODE AVAILABILITY For data screening, the publicly available online platform rayyain.ai was used (free plan), see

https://www.rayyan.ai. REFERENCES * Müskens, J. L. J. M., Kool, R. B., Van Dulmen, S. A. & Westert, G. P. Overuse of diagnostic testing in healthcare: a systematic review. _BMJ Qual.

Saf._ 31, 54–63 (2022). Article PubMed Google Scholar * Nobel, J. M., Van Geel, K. & Robben, S. G. F. Structured reporting in radiology: a systematic review to explore its potential.

_Eur. Radiol._ 32, 2837–2854 (2022). Article PubMed Google Scholar * Khurana, D., Koli, A., Khatter, K. & Singh, S. Natural language processing: state of the art, current trends and

challenges. _Multimed. Tools Appl._ 82, 3713–3744 (2023). Article PubMed Google Scholar * Jurafsky, D. & Martin, J. H. _Speech and Language Processing. An Introduction to Natural

Language Processing, Computational Linguistics, and Speech Recognition_ (Pearson Education, 2024). * Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large

language models. _Nat. Rev. Phys._ 5, 277–280 (2023). Article Google Scholar * Vaswani, A. et al. Attention is all you need. In _Advances in Neural Information Processing Systems_, Vol. 30

(Curran Associates, Inc., 2017). * Peters, M. E. et al. Deep contextualized word representations 1802. 05365 (2018). * Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT:

Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C. & Solorio, T. (eds.) In _Proc. Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)_, 4171–4186 (Association for Computational Linguistics, Minneapolis, Minnesota,

2019). * Brown, T. et al. Language models are few-shot learners. In _Advances in Neural Information Processing Systems_, vol. 33, 1877–1901 (Curran Associates, Inc., 2020). * OpenAI et al.

GPT-4 Technical Report 2303.08774. (2023). * Shoeybi, M. et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 1909.08053 (2020). * Pons, E., Braun, L.

M. M., Hunink, M. G. M. & Kors, J. A. Natural language processing in radiology: a systematic review. _Radiology_ 279, 329–343 (2016). Article PubMed Google Scholar * Casey, A. et al.

A systematic review of natural language processing applied to radiology reports. _BMC Med. Inform. Decis. Mak._ 21, 179 (2021). Article PubMed PubMed Central Google Scholar * Davidson,

E. M. et al. The reporting quality of natural language processing studies: systematic review of studies of radiology reports. _BMC Med. Imaging_ 21, 142 (2021). Article PubMed PubMed

Central Google Scholar * Saha, A., Burns, L. & Kulkarni, A. M. A scoping review of natural language processing of radiology reports in breast cancer. _Front. Oncol._ 13, 1160167

(2023). Article PubMed PubMed Central Google Scholar * Gholipour, M., Khajouei, R., Amiri, P., Hajesmaeel Gohari, S. & Ahmadian, L. Extracting cancer concepts from clinical notes

using natural language processing: a systematic review. _BMC Bioinform._ 24, 405 (2023). Article Google Scholar * Gorenstein, L., Konen, E., Green, M. & Klang, E. Bidirectional encoder

representations from transformers in radiology: a systematic review of natural language processing applications. _J. Am. Coll. Radiol._ 21, 914–941 (2024). Article PubMed Google Scholar

* Wood, D. A. et al. Automated labelling using an attention model for radiology reports of MRI scans (ALARM). In Arbel, T. et al. (eds.) _Proceedings of the Third Conference on Medical

Imaging with Deep Learning_, vol. 121 of _Proceedings of Machine Learning Research_, 811–826 (PMLR, 2020-07-06/2020-07-08). * Wood, D. A. et al. Deep learning to automate the labelling of

head MRI datasets for computer vision applications. _Eur. Radiol._ 32, 725–736 (2022). Article PubMed Google Scholar * Li, Z. & Ren, J. Fine-tuning ERNIE for chest abnormal imaging

signs extraction. _J. Biomed. Inform._ 108, 103492 (2020). Article PubMed Google Scholar * Lybarger, K., Damani, A., Gunn, M., Uzuner, O. Z. & Yetisgen, M. Extracting radiological

findings with normalized anatomical information using a span-based BERT relation extraction model. _AMIA Jt. Summits Transl. Sci. Proc._ 2022, 339–348 (2022). PubMed PubMed Central Google

Scholar * Kuling, G., Curpen, B. & Martel, A. L. BI-RADS BERT and using section segmentation to understand radiology reports. _J. Imaging_ 8, 131 (2022). Article PubMed PubMed Central

Google Scholar * Lau, W., Lybarger, K., Gunn, M. L. & Yetisgen, M. Event-based clinical finding extraction from radiology reports with pre-trained language model. _J. Digit. Imaging_

36, 91–104 (2023). Article PubMed Google Scholar * Sugimoto, K. et al. End-to-end approach for structuring radiology reports. _Stud. Health Technol. Inform._ 270, 203–207 (2020). PubMed

Google Scholar * Zhang, Y. et al. Using recurrent neural networks to extract high-quality information from lung cancer screening computerized tomography reports for inter-radiologist audit

and feedback quality improvement. _JCO Clin. Cancer Inform._ 7, e2200153 (2023). Article PubMed Google Scholar * Tejani, A. S. et al. Performance of multiple pretrained BERT models to

automate and accelerate data annotation for large datasets. _Radiol. Artif. Intell._ 4, e220007 (2022). Article PubMed PubMed Central Google Scholar * Zaman, S. et al. Automatic

diagnosis labeling of cardiovascular MRI by using semisupervised natural language processing of text reports. _Radiol. Artif. Intell._ 4, e210085 (2022). Article PubMed Google Scholar *

Liu, H. et al. Use of BERT (bidirectional encoder representations from transformers)-based deep learning method for extracting evidences in chinese radiology reports: Development of a

computer-aided liver cancer diagnosis framework. _J. Med. Internet Res._ 23, e19689 (2021). Article PubMed PubMed Central Google Scholar * Jaiswal, A. et al. RadBERT-CL: factually-aware

contrastive learning for radiology report classification. In _Proc. Machine Learning for Health_, 196–208 (PMLR, 2021). * Torres-Lopez, V. M. et al. Development and validation of a model to

identify critical brain injuries using natural language processing of text computed tomography reports. _JAMA Netw. Open_ 5, e2227109 (2022). Article PubMed PubMed Central Google Scholar

* Pérez-Díez, I., Pérez-Moraga, R., López-Cerdán, A., Salinas-Serrano, J. M. & la Iglesia-Vayá, M. De-identifying Spanish medical texts - named entity recognition applied to radiology

reports. _J. Biomed. Semant._ 12, 6 (2021). Article Google Scholar * Lau, W., Payne, T. H., Uzuner, O. & Yetisgen, M. Extraction and analysis of clinically important follow-up

recommendations in a large radiology dataset. _AMIA Summits Transl. Sci. Proc._ 2020, 335–344 (2020). PubMed PubMed Central Google Scholar * Santos, T. et al. A fusion NLP model for the

inference of standardized thyroid nodule malignancy scores from radiology report text. _Annu. Symp. Proc. AMIA Symp._ 2021, 1079–1088 (2021). PubMed Google Scholar * Fink, M. A. et al.

Deep learning–based assessment of oncologic outcomes from natural language processing of structured radiology reports. _Radiol. Artif. Intell._ 4, e220055 (2022). Article PubMed PubMed

Central Google Scholar * Datta, S. et al. Understanding spatial language in radiology: representation framework, annotation, and spatial relation extraction from chest X-ray reports using

deep learning. _J. Biomed. Inform._ 108, 103473 (2020). Article PubMed PubMed Central Google Scholar * Datta, S. & Roberts, K. Spatial relation extraction from radiology reports

using syntax-aware word representations. _AMIA Jt. Summits Transl. Sci. Proc._ 2020, 116–125 (2020). PubMed PubMed Central Google Scholar * Datta, S. & Roberts, K. A Hybrid deep

learning approach for spatial trigger extraction from radiology reports. In _Proc. Third International Workshop on Spatial Language Understanding_, 50–55 (Association for Computational

Linguistics, Online, 2020). * Zhang, H. et al. A novel deep learning approach to extract Chinese clinical entities for lung cancer screening and staging. _BMC Med. Inform. Decis. Mak._ 21,

214 (2021). Article PubMed PubMed Central Google Scholar * Hu, D. et al. Automatic extraction of lung cancer staging information from computed tomography reports: Deep learning approach.

_JMIR Med. Inform._ 9, e27955 (2021). Article PubMed PubMed Central Google Scholar * Datta, S., Khanpara, S., Riascos, R. F. & Roberts, K. Leveraging spatial information in

radiology reports for ischemic stroke phenotyping. _AMIA Jt. Summits Transl. Sci. Proc._ 2021, 170–179 (2021). PubMed PubMed Central Google Scholar * Dada, A. et al. Information

extraction from weakly structured radiological reports with natural language queries. _Eur. Radiol._ 34, 330–337 (2023). * Eisenhauer, E. et al. New response evaluation criteria in solid

tumours: revised RECIST guideline (version 1.1). _Eur. J. Cancer_ 45, 228–247 (2009). Article CAS PubMed Google Scholar * Rosen, R. D. & Sapra, A. TNM Classification. In _StatPearls_

(StatPearls Publishing, 2023). * University of California Berkeley. HIPAA PHI: definition of PHI and List of 18 Identifiers. https://cphs.berkeley.edu/hipaa/hipaa18.html# (2023). * Stanford

NLP Group. Stanfordnlp/stanza. Stanford NLP (2024). * Sugimoto, K. et al. Extracting clinical terms from radiology reports with deep learning. _J. Biomed. Inform._ 116, 103729 (2021).

Article PubMed Google Scholar * US National Institutes of Health. NationalCancer Institute. NCI Thesaurus. https://ncit.nci.nih.gov/ncitbrowser/. * Datta, S., Godfrey-Stovall, J. &

Roberts, K. RadLex normalization in radiology reports. _AMIA Annu. Symp. Proc._ 2020, 338–347 (2021). PubMed PubMed Central Google Scholar * Zhang, Z. et al. ERNIE: Enhanced Language

Representation with Informative Entities In _Proc. 57th Annual Meeting of the Association for Computational Linguistics_, pages 1441–1451, Florence, Italy. Association for Computational

Linguistics (2019). * Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient Estimation of Word Representations in Vector Space 1301.3781 (2013). * Huang, X., Chen, H. & Yan, J. D.

Study on structured method of Chinese MRI report of nasopharyngeal carcinoma. _BMC Med. Inform. Decis. Mak._ 21, 203 (2021). Article PubMed PubMed Central Google Scholar * DocCheck.

DocCheck Flexicon. https://flexikon.doccheck.com/de/Hauptseite (2024). * Jantscher, M. et al. Information extraction from German radiological reports for general clinical text and language

understanding. _Sci. Rep._ 13, 2353 (2023). Article CAS PubMed PubMed Central Google Scholar * Zhang, X. et al. Extracting comprehensive clinical information for breast cancer using

deep learning methods. _Int. J. Med. Inform._ 132, 103985 (2019). Article PubMed Google Scholar * Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest

radiographs with free-text reports. _Sci. Data_ 6, 317 (2019). Article PubMed PubMed Central Google Scholar * Moody, G. B. & Mark, R. G. The MIMIC Database (1992). * Datta, S. &

Roberts, K. Weakly supervised spatial relation extraction from radiology reports. _JAMIA Open_ 6, ooad027 (2023). Article PubMed PubMed Central Google Scholar * Johnson, A. E. W. et al.

MIMIC-III, a freely accessible critical care database. _Sci. Data_ 3, 160035 (2016). Article CAS PubMed PubMed Central Google Scholar * Datta, S. & Roberts, K. Fine-grained spatial

information extraction in radiology as two-turn question answering. _Int. J. Med. Inform._ 158, 104628 (2022). * Datta, S. et al. Rad-SpatialNet: a frame-based resource for fine-grained

spatial relations in radiology reports. In Calzolari, N. _et al_. (eds.) _Proc. Twelfth Language Resources and Evaluation Conference_, 2251–2260 (European Language Resources Association,

Marseille, France, 2020). * Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. _J. Am. Med. Inform. Assoc._ 23, 304–310 (2016).

Article PubMed Google Scholar * Mithun, S. et al. Clinical concept-based radiology reports classification pipeline for lung carcinoma. _J. Digit. Imaging_ 36, 812–826 (2023). Article

PubMed PubMed Central Google Scholar * Bressem, K. K. et al. Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8

million text reports. _Bioinformatics_ 36, 5255–5261 (2021). Article PubMed Google Scholar * Singh, V. et al. Impact of train/test sample regimen on performance estimate stability of

machine learning in cardiovascular imaging. _Sci. Rep._ 11, 14490 (2021). Article CAS PubMed PubMed Central Google Scholar * Demler, O. V., Pencina, M. J. & D’Agostino, R. B. Misuse

of DeLong test to compare AUCs for nested models. _Stat. Med._ 31, 2577–2587 (2012). Article PubMed PubMed Central Google Scholar * Radford, A., Narasimhan, K., Salimans, T. &

Sutskever, I. Improving language understanding by generative pre-training (2018). * Thirunavukarasu, A. J. et al. Large language models in medicine. _Nat. Med._ 29, 1930–1940 (2023). Article

CAS PubMed Google Scholar * Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. Detecting hallucinations in large language models using semantic entropy. _Nature_ 630, 625–630 (2024).

Article CAS PubMed PubMed Central Google Scholar * Zhang, Y. & Xu, Z. BERT for question answering on SQuAD 2.0 (2019). * OECD. Diagnostic technologies (2023). * Viry, A. et al.

Annual exposure of the Swiss population from medical imaging in 2018. _Radiat. Prot. Dosim._ 195, 289–295 (2021). Article Google Scholar * Reichenpfader, D., Müller, H. & Denecke, K.

Large language model-based information extraction from free-text radiology reports: a scoping review protocol. _BMJ Open_ 13, e076865 (2023). Article PubMed PubMed Central Google Scholar

* Shanahan, M., McDonell, K. & Reynolds, L. Role play with large language models. _Nature_ 623, 493–498 (2023). Article CAS PubMed Google Scholar * Liang, S. et al. Fine-tuning

BERT Models for Summarizing German Radiology Findings. In Naumann, T., Bethard, S., Roberts, K. & Rumshisky, A. (eds.) _Proc. 4th Clinical Natural Language Processing Workshop_, 30–40

(Association for Computational Linguistics, Seattle, WA, 2022). * Adams, L. C. et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a

multilingual feasibility study. _Radiology_ 307, e230725 (2023). Article PubMed Google Scholar * Nowak, S. et al. Transformer-based structuring of free-text radiology report databases.

_Eur. Radiol._ 33, 4228–4236 (2023). Article CAS PubMed PubMed Central Google Scholar * Košprdić, M., Prodanović, N., Ljajić, A., Bašaragin, B. & Milošević, N. From zero to hero:

harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts 2305.04928 (2023). * Smit, A. et al. Combining Automatic Labelers and Expert Annotations for

Accurate Radiology Report Labeling Using BERT. In Webber, B., Cohn, T., He, Y. & Liu, Y. (eds.) _Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP)_, 1500–1519

(Association for Computational Linguistics, Online, 2020). * Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information

extractors. In Goldberg, Y., Kozareva, Z. & Zhang, Y. (eds.) _Proc. Conference on Empirical Methods in Natural Language Processing_, 1998–2022 (Association for Computational Linguistics,

Abu Dhabi, United Arab Emirates, 2022). * Kartchner, D., Ramalingam, S., Al-Hussaini, I., Kronick, O. & Mitchell, C. Zero-shot information extraction for clinical meta-analysis using

large language models. In Demner-fushman, D., Ananiadou, S. & Cohen, K. (eds.) _Proc. 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks_, 396–405

(Association for Computational Linguistics, Toronto, Canada, 2023). * Jupin-Delevaux, É. et al. BERT-based natural language processing analysis of French CT reports: application to the

measurement of the positivity rate for pulmonary embolism. _Res. Diagn. Interv. Imaging_ 6, 100027 (2023). PubMed PubMed Central Google Scholar * Rohanian, O., Nouriborji, M., Kouchaki,

S. & Clifton, D. A. On the effectiveness of compact biomedical transformers. _Bioinformatics_ 39, btad103 (2023). Article CAS PubMed PubMed Central Google Scholar * Gemini Team,

Google. Gemini: a family of highly capable multimodal models. https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf (2024). * Raffel, C. et al. Exploring the limits of

transfer learning with a unified text-to-text transformer 1910.10683 (2023). * Llama-3. Meta (2024). * Jiang, A. Q. et al. Mixtral of experts 2401.04088 (2024). * Chen, Z. et al.

MEDITRON-70B: scaling medical pretraining for large language models 2311.16079 (2023). * Singhal, K. et al. Towards expert-level medical question answering with large language models

2305.09617 (2023). * Saab, K. et al. Capabilities of Gemini models in medicine 2404.18416 (2024). * Van Veen, D. et al. Adapted large language models can outperform medical experts in

clinical text summarization. _Nat. Med._ 30, 1134–1142 (2024). * Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. _Nature_ 619, 357–362 (2023).

Article CAS PubMed PubMed Central Google Scholar * Singhal, K. et al. Large language models encode clinical knowledge. _Nature_ 620, 172–180 (2023). Article CAS PubMed PubMed Central

Google Scholar * Dagdelen, J. et al. Structured information extraction from scientific text with large language models. _Nat. Commun._ 15, 1418 (2024). Article CAS PubMed PubMed

Central Google Scholar * Zheng, L. et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. _Adv. Neural Inf. Process Syst._ 36, 46595–46623 (2023). Google Scholar * Mukherjee, P.,

Hou, B., Lanfredi, R. B. & Summers, R. M. Feasibility of using the privacy-preserving large language model Vicuna for labeling radiology reports. _Radiology_ 309, e231147 (2023). Article

PubMed Google Scholar * Bressem, K. K. et al. MEDBERT.de: a comprehensive German BERT model for the medical domain. _Expert Syst. Appl._ 237, 121598 (2024). Article Google Scholar *

Hu, D., Liu, B., Zhu, X., Lu, X. & Wu, N. Zero-shot information extraction from radiological reports using ChatGPT. _Int. J. Med. Inform._ 183, 105321 (2024). Article PubMed Google

Scholar * Hu, D., Li, S., Zhang, H., Wu, N. & Lu, X. Using natural language processing and machine learning to preoperatively predict lymph node metastasis for non–small cell lung

cancer with electronic medical records: development and validation study. _JMIR Med. Inform._ 10, e35475 (2022). Article PubMed PubMed Central Google Scholar * Mallio, C. A., Sertorio,

A. C., Bernetti, C. & Beomonte Zobel, B. Large language models for structured reporting in radiology: performance of GPT-4, ChatGPT-3.5, Perplexity and Bing. _La Radiol. Med._ 128,

808–812 (2023). Article Google Scholar * Zhao, H. et al. Explainability for large language models: a survey. _ACM Trans. Intell. Syst. Technol._ 15, 1–38 (2024). Article CAS Google

Scholar * Yang, H. et al. Unveiling the generalization power of fine-tuned large language models. In _Proc. of the 2024 Conference of the North American Chapter of the Association for

Computational Linguistics: Human Language Technologies_ (Volume 1: Long Papers) (eds Duh, K., Gomez, H. & Bethard, S.) 884–899 (Association for Computational Linguistics, Mexico City,

Mexico, 2024). https://doi.org/10.18653/v1/2024.naacl-long.51. * Abdin, M. et al. Phi-3 technical report: a highly capable language model locally on your phone 2404.14219 (2024). * Gilbert,

S., Kather, J. N. & Hogan, A. Augmented non-hallucinating large language models as medical information curators. _npj Digital Med._ 7, 1–5 (2024). Article Google Scholar * He, P., Liu,

X., Gao, J. & Chen, W. DeBERTa: decoding-enhanced BERT with disentangled attention 2006.03654 (2021). * Kakarmath, S. et al. Best practices for authors of healthcare-related artificial

intelligence manuscripts. _NPJ Digit. Med._ 3, 134 (2020). Article PubMed PubMed Central Google Scholar * Rayyan - AI Powered Tool for Systematic Literature Reviews (2021). * Si, Y.,

Wang, J., Xu, H. & Roberts, K. Enhancing clinical concept extraction with contextual embeddings. _J. Am. Med. Inform. Assoc._ 26, 1297–1304 (2019). Article PubMed PubMed Central

Google Scholar * Liu, Y. et al. RoBERTa: a robustly optimized BERT pretraining approach 1907.11692 (2019). * Lee, J. et al. BioBERT: a pre-trained biomedical language representation model

for biomedical text mining. _Bioinformatics_ 36, 1234–1240 (2020). Article CAS PubMed Google Scholar * Alsentzer, E. et al. Publicly Available Clinical BERT Embeddings. In Rumshisky, A.,

Roberts, K., Bethard, S. & Naumann, T. (eds.) _Proc. 2nd Clinical Natural Language Processing Workshop_, 72–78 (Association for Computational Linguistics, Minneapolis, Minnesota, USA,

2019). * Deepset. German BERT. https://huggingface.co/bert-base-german-cased (2019). * Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing.

_ACM Trans. Comput. Healthc._ 3, 2:1–2:23 (2021). Google Scholar * Sanh, V., Debut, L., Chaumond, J. & Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and

lighter 1910.01108 (2020). * Cui, Y., Che, W., Liu, T., Qin, B. & Yang, Z. Pre-training with whole word masking for Chinese BERT. _IEEE/ACM Trans. Audio, Speech, Lang. Process._ 29,

3504–3514 (2021). Article Google Scholar * Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking

datasets. In _Proc. of the 18th BioNLP Workshop and Shared Task_ (eds Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J.) 58–65 (Association for Computational Linguistics,

Florence, Italy, 2019). https://doi.org/10.18653/v1/W19-5006. * Chan, B., Schweter, S. & Möller, T. German’s next language model. In _Proc. of the 28th International Conference on

Computational Linguistics_ (eds Scott, D., Bel, N. & Zong, C.) 6788–6796 (International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020).

https://doi.org/10.18653/v1/2020.coling-main.598. * Shrestha, M. _Development of a Language Model for the Medical Domain_. Ph.D. thesis (Rhine-Waal University of Applied Sciences, 2021). *

The MultiBERTs: BERT reproductions for robustness analysis. In Sellam, T. et al. (eds.) _ICLR 2022_ (2022). * Wu, S. & He, Y. Enriching pre-trained language model with entity information

for relation classification. In _Proc. of the 28th ACM International Conference on Information and Knowledge Management_, 2361–2364 (Association for Computing Machinery, New York, NY, USA,

2019). https://doi.org/10.1145/3357384.3358119. * Beltagy, I., Lo, K. & Cohan, A. SciBERT: a pretrained language model for scientific text. In _Proc. Conference on Empirical Methods in

Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)_, (eds Inui, K., Jiang, J., Ng, V. & Wan, X.), 3615–3620 (Association

for Computational Linguistics, Hong Kong, China, 2019). * Eberts, M. & Ulges, A. Span-based joint entity and relation extraction with transformer pre-training. In _ECAI 2020_, 2006–2013

(IOS Press, 2020). * Yang, Z. et al. XLNet: Generalized autoregressive pretraining for language understanding. In _Advances in Neural Information Processing Systems_ vol. 32 (Curran

Associates, Inc., 2019). Download references ACKNOWLEDGEMENTS This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. We thank

Cornelia Zelger for her support during the search query definition process. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Institute for Patient-Centered Digital Health, Bern University of

Applied Sciences, Biel/Bienne, Switzerland Daniel Reichenpfader & Kerstin Denecke * Faculty of Medicine, University of Geneva, Geneva, Switzerland Daniel Reichenpfader * Department of

Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland Henning Müller * Informatics Institute, HES-SO Valais-Wallis, Sierre, Switzerland Henning Müller Authors * Daniel

Reichenpfader View author publications You can also search for this author inPubMed Google Scholar * Henning Müller View author publications You can also search for this author inPubMed

Google Scholar * Kerstin Denecke View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS D.R. conceptualized the study, defined the methodology

(incl. the search strategy), performed the database searches and managed the screening process. D.R. also performed data extraction and authored the original draft. K.D. focused on reviewing

and editing the manuscript. K.D. also participated in the screening process. H.M. provided supervision and contributed to writing review. CORRESPONDING AUTHOR Correspondence to Daniel

Reichenpfader. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to

jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTAL MATERIAL RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as

long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not

have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s

Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not

permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

http://creativecommons.org/licenses/by-nc-nd/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Reichenpfader, D., Müller, H. & Denecke, K. A scoping review of large

language model based approaches for information extraction from radiology reports. _npj Digit. Med._ 7, 222 (2024). https://doi.org/10.1038/s41746-024-01219-0 Download citation * Received:

21 February 2024 * Accepted: 09 August 2024 * Published: 24 August 2024 * DOI: https://doi.org/10.1038/s41746-024-01219-0 SHARE THIS ARTICLE Anyone you share the following link with will be

able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing

initiative

Us hits back at uk as joe biden charm offensive fails

Donald Trump’s trade office warned Britain it had "no authority from the World Trade Organisation (WTO)" to im...

Peter kay: car share star lifts fans' spirits with big tv news

“Can’t Peter do a video to cheer us all up, we love him so much he is a British legend would love to see his cheeky happ...

Data for evaluation of fast kurtosis strategies, b-value optimization and exploration of diffusion mri contrast

ABSTRACT Here we describe and provide diffusion magnetic resonance imaging (dMRI) data that was acquired in neural tissu...

Birds in London | Nature

Books Received Published: 23 June 1898 Birds in London R. L. Nature volume 58, page 172 (1898)Cite this article 99 Acce...

Self-assembling structures to treat diabetes

Scientists from the Birla Institute of Technology and Science, Pilani have developed a method by which lisofylline, a mo...

Page not found | Nature

Skip to main content Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To...

Latest News

A scoping review of large language model based approaches for information extraction from radiology reports

ABSTRACT Radiological imaging is a globally prevalent diagnostic method, yet the free text contained in radiology report...

Police threaten action against pro-palestinian protesters setting off fireworks

The Metropolitan Police has issued a warning that pro-Palestinian protesters who set off fireworks and flares at today&#...

Bordeaux to paris in just two hours

NEW EUR 7.8BN HIGH-SPEED RAIL LINK IS GIVEN THE GO-AHEAD AND IS EXPECTED TO OPEN IN 2017 A NEW high-speed rail link that...

404 Not Found!

You're using an Ad-Blocker. BeforeItsNews only exists through ads.We ask all patriots who appreciate the evil we expose ...

Freedom of information request on the recommendation of fentanyl patches to add a conrtaindication to the smpc (foi 22/642)

* Medicines & Healthcare products Regulatory Agency FOI release FREEDOM OF INFORMATION REQUEST ON THE RECOMMENDATION...

More spot checks on smoking cafes

PRÉFECTURES ORDERED TO VISIT MORE BARS AND RESTAURANTS AND REPORT BACK ON HOW WELL SMOKING BAN IS BEING APPLIED BARS, re...