Comprehensive reporting guidelines and checklist for studies developing and utilizing artificial intelligence models
Article information
Abstract
Background
The rapid advancement of artificial intelligence (AI) in healthcare necessitates comprehensive and standardized reporting guidelines to ensure transparency, reproducibility, and ethical applications in clinical research. Existing reporting standards are limited by their focus on specific study designs. We aimed to develop a comprehensive set of guidelines and a checklist for reporting studies that develop and utilize AI models in healthcare, covering all essential components of AI research regardless of the study design.
Methods
Two experts in statistics from the Statistical Round of the Korean Journal of Anesthesiology developed these guidelines and checklist. The key elements essential for AI model reporting were identified and organized into structured sections, including study design, data preparation, model training and evaluation, ethical considerations, and clinical implementation. Iterative reviews and feedback from clinicians and researchers were used to finalize the guidelines and checklist.
Results
These guidelines provide a detailed description of each item on the checklist, ensuring comprehensive reporting of AI model research. Full details regarding the AI model specifications and data-handling processes are provided.
Conclusions
These guidelines and checklist are meant to serve as valuable tools for researchers, addressing key aspects of AI reporting, and thereby supporting the reliability, accountability, and ethical use of AI in healthcare research.
Introduction
The use of artificial intelligence (AI) in healthcare research is rapidly transforming clinical practice and decision-making by enhancing diagnostic accuracy, improving treatment strategies, and streamlining patient management [1,2]. Given the increase in the use of AI models across various medical disciplines, ensuring AI-based studies are transparent, reproducible, and ethically sound is crucial [3]. Existing guidelines, such as CONSORT-AI [4], DECIDE-AI [5], TRIPOD-AI [6], and CLAIM [7], provide valuable standards for reporting AI research. However, these guidelines are tailored to specific study designs or applications, such as clinical trials or diagnostic accuracy studies, and may not encompass the full range of AI study methodologies.
Comprehensive reporting of the critical aspects of AI research, such as model training, data handling, evaluation metrics, and safety protocols, is essential to ensure that AI systems are effective and reliable in real-world applications. Furthermore, issues of equity, transparency, and patient safety should be addressed to build public trust and ensure that AI innovations contribute positively to healthcare outcomes [8].
This article presents a set of comprehensive reporting guidelines and a checklist designed to standardize the reporting of AI studies across diverse study designs. The checklist provides a succinct overview, while the guidelines elaborate on each checklist item. Key aspects of AI model development and utilization are addressed, including study design, data handling, model training and evaluation, error management, clinical applicability, and safety and ethical considerations, to support the reproducibility, validity, and ethical integrity of AI research [9]. By standardizing the reporting process, these guidelines and checklist aim to foster trust in AI research and advance the responsible use of AI in clinical practice.
Materials and Methods
Identification of key reporting elements
These guidelines and checklist were developed by two experts in statistics from the Statistical Round of the Korean Journal of Anesthesiology. At the beginning of the development process, all elements essential for the transparent and accurate reporting of AI-based studies were identified. These elements encompass all critical aspects of AI model development and usage in clinical settings and address the practical challenges that researchers face when documenting AI research processes to ultimately enhance reproducibility and accountability.
Drafting of the guidelines and checklist
A preliminary draft was created, with the guidelines structured into sections relevant to AI model reporting. Concurrently, a checklist was designed to include the items that researchers could follow to systematically document their work. Researchers are meant to use this checklist as a guide for comprehensive reporting.
Incorporation of clinical scenarios
Clinical scenarios were incorporated into the guidelines when necessary. These examples supplement the checklist items by illustrating common challenges encountered in AI-based clinical research, such as variability in data sources, potential biases in model predictions, and considerations for patient safety when using AI in clinical decision-making.
Finalization of the draft after iterative review and revision
The guidelines and checklist underwent multiple rounds of review and revision. Each item was carefully evaluated to ensure that the final version was comprehensive and user-friendly. Feedback from clinicians and researchers was sought for diverse AI research applications in medical practice. The draft was finalized once a structured and systematic framework for documenting AI studies in clinical research was reached.
Results
This section presents the guidelines and checklist that were developed by the authors (Table 1). The guidelines provide detailed elaborations of each checklist item to guide researchers in thoroughly documenting each component of their AI model study.

Comprehensive Reporting Checklist for Studies Developing and Utilizing Artificial Intelligence Models
Comprehensive reporting guidelines for studies developing and utilizing artificial intelligence models
Title
Indicating the use of AI techniques in the title enhances the searchability of the study. Using broader terms such as ‘artificial intelligence’ or ‘machine learning’ in the title can also be understood by a wider audience. However, if a specific type of AI model is well known (e.g., deep learning), it may also be used in the title. More precise terminology on the specific AI models and architectures may be reserved for the Abstract.
Abstract
The abstract should provide a structured summary that includes the study design/setting (e.g., prospective or retrospective), an overview of the study population (number of patients, users, examinations, and/or images and age and sex distributions), the type of underlying AI algorithms, an outline of the statistical analyses performed (e.g., P values for comparisons, 95% CI), primary and secondary outcomes, main results, and conclusions. The abstract should thus be comprehensible without reading the full manuscript. The abstract should also indicate the public availability of the software, data, and/or resulting model.
Introduction
The use or setting of AI techniques and the selection of the target population should be supported by pre-existing or unpublished evidence that addresses clinically and scientifically important issues and questions requiring the use of the AI system. This evidence may indicate previous development of the AI model, internal and external validation, and/or modifications made prior to the current study. The medical conditions of interest, their related problems (e.g., limitations of the standard medical practice that the AI-based model is to be compared with), and the target population should be clearly described. The role that the AI will play within the clinical pathway to resolve the addressed clinical problems should also be described. In addition, a specific question that could be answered using the AI model should be established. As the role of the AI model may differ depending on who uses it, its users (e.g., physicians, patients, the public) should also be clearly defined. The study objectives, rationale, hypothesis, and anticipated clinical effects or outcomes should also be described.
Methods
1. Study design, setting, and population
1) Study design
The authors should clearly indicate whether the study was conducted prospectively or retrospectively. Additionally, whether the study goal was to evaluate the feasibility of the AI system or its superiority or non-inferiority to the reference test or model (current standard methodologies or models), to conduct an exploratory analysis, or to build a predictive model should be clearly stated. Details on the study goal should also be described (e.g., screening, diagnosis, or staging of a disease; prediction of disease development; anticipation of the prognosis). A reference (e.g., current clinical standards or model), which is to be compared with the AI system to assess its performance, should be clearly described.
2) Study settings
The authors should provide details on the setting where the study was conducted. This includes the type and size of the study environment (e.g., tertiary university hospital, private clinics, public health centers); its sublocation (e.g., operating room, examination room, immunization unit); and the availability of supporting facilities, services, or technologies relevant to the study (e.g., robotic surgical system, otoscope, COVID-19 vaccines). How the study setting and cohorts represent real-world clinical conditions should also be stated. Technical requirements and configurations specific to each study site should also be described in detail (e.g., software, hardware, specialized computing devices, vendor-specific equipment, site-specific modification of the AI algorithm). Because AI models may perform well in the environment where they were developed, providing this information allows for these limitations in the generalizability to be better understood [10].
3) Study population
The process used to recruit the participants should be stated, along with the inclusion and exclusion criteria. A flow diagram indicating the number of participants included at each stage should be included if possible [4]. Alternatively, a flow chart can be presented and explained in the Results section.
2. Description of the AI system
1) Study data
The characteristics and quality of the input data significantly affect the performance of AI systems [11]. In addition, a detailed description of input data handling allows for the replication of AI system use outside the study setting and can be used to determine whether data handling is standardized across study sites. Therefore, the original data source (e.g., electronic medical records, public data registry, prospective data collection) and the process used to obtain it from the study population should be stated along with the time period when the data were obtained. If the data were acquired using a specific device/software (e.g., electrocardiographic waveform from a patient monitor), the product information (e.g., device/software name and model, manufacturer information [name, city, country of origin]) and data acquisition protocols should be described in detail (e.g., the frequency at which the waveform was recorded, the type of filter [low-pass or band-pass filter], and the ranges of the filtered frequency). If the data underwent reformatting, the process should be fully reported with its relevant parameters (e.g., the frequency at which the obtained waveforms were resampled or downsampled). If data collection depended on the investigator’s subjective expertise, the number of investigators, their qualifications, and the introductions and training materials used should be described. Whether the measurements and/or observations were independent among the investigators and how inter- and intra-investigator variability was detected and handled should be specified as well. Additionally, the inclusion and exclusion criteria for the AI system input data should be described.
The authors should also specify whether the data were structured. Structured data have clearly defined features (e.g., name of diagnosis, medical procedure, medication, laboratory test result values, population characteristic variables [age, sex, height, weight]), whereas unstructured data lack features that can be explicitly defined (e.g., images, videos, audio recordings, text data, time-series data).
Data pre-processing converts raw data with different formats from various sources into a uniform and consistent format that can be read and used as input by the AI system. The data format compatible with the intended use of the AI model varies according to the type of AI system used (e.g., radiographic images, hemodynamic parameters, laboratory results). Minimum requirements should be set that determine the eligibility of the data before input to the AI system (e.g., image resolution ranges, number of complete or missing data per participant). The authors should also describe how data that did not meet the minimum requirements were handled and how this impacted the clinical pathway that includes the AI system.
In particular, the definition of missing and poor-quality data (e.g., electrocautery artifacts in the electroencephalogram) and outliers, their quantity, and how they were detected and handled should be described, as they diminish AI system performance [12]. If these data were imputed using specific techniques (e.g., last observation carried forward), the resulting biases should be described. The same information should be provided for the comparator (control or reference intervention).
Details on data transformation (e.g., normalization, standardization, rescaling, natural-log transformation, encoding categorical variables), feature engineering, and feature selection should be provided such that other researchers can reproduce the process. In particular, the data should be de-identified, protected health information should be completely removed, and facial images should also be rendered unidentifiable [13]. Accordingly, the processes used to de-identify and protect personal information should be fully described.
Whether the data were processed or unprocessed before the analysis should be specified, along with whether the data were acquired before the application of the AI system or were generated with the use of the AI system.
The ground truth, which is used as a reference for comparison in supervised learning and can be clinically measured using the gold standard (e.g., histopathologic diagnosis and consensus agreement from a panel of experts), should be annotated with a precise definition. For example, hypotension is defined as a clinical condition in which each beat-to-beat systolic blood pressure measured from a catheter placed in the lumen of the right radial artery with the transducer diaphragm placed at the level of the mid-axillary line intersecting the fourth intercostal space [14] is maintained below 80 mmHg for more than 1 min between anesthesia induction and the end of surgery. Using unclear definitions with insufficient information, such as blood pressure < 80 mmHg, should be avoided.
2) Study output
The output of the AI system should be specified. In the context of the clinical problem that the current study intends to address, this can include disease diagnosis, grading of disease severity, probability of a clinical event or disease occurrence, disease treatment options, and prediction of clinical parameter values. A uniform format should be used for both the output and ground truth.
If the AI output is used to determine subsequent clinical management that ultimately affects clinical outcomes, this should be described in detail. If clinical management involves medical practice performed by the researchers, this should be standardized. The output of the AI system should also be fully understood and interpreted by the researchers involved in clinical management guided by the output. For example, an intraoperative hypotension prediction algorithm that calculates the probability of a hypotensive episode requires researchers to interpret the probability and take standardized actions based on the probability threshold (e.g., administering weight-based doses of a vasoactive agent when the probability of a hypotensive episode is > 70%).
If the study design compares the performance of the AI system to that of a reference clinical protocol, the process used in the reference protocol to determine subsequent clinical management should be explained in the same manner that the AI system is used to make clinical decisions. Accordingly, the rationale for using the reference standard and its inherent limitations, including errors and biases, should be described.
3) Data separation
The process used to split up the full dataset at the beginning of the study should be described. The dataset can be split into training and test sets, or into training, validation (tuning), and test sets. The proportion of each set to the full dataset should be reported with the rationale. Using an external test set from an independent study site to test the trained model (external validation) is the ideal standard. Otherwise (in the case of internal validation), explicitly reporting and justifying the decision not to take the test set data from a data source external to that of the training data is essential. If the data structure between the training and test sets differs, the measures used to accommodate the difference should be explained.
To prevent bias, the test set needs to represent the target population. Certain methods (e.g., stratified sampling) can be used to maintain the distribution of the clinical outcome variables in the test set such that it is similar to that of the target population. Accordingly, the distribution of variables (including demographics and clinical parameters) in the training, validation, and test sets should be reported and statistically compared to show that the distributions are similar across sets. If any systematic differences are found, the factors involved should be investigated.
To prevent overfitting of the trained model, which shows good model performance on the training set but poor model performance on the validation set, internal k-fold cross-validation can be conducted. This type of validation involves splitting the dataset into k-subsets and training and validating the model k times using each subset as the validation set and the remaining k-1 subsets as the training set. When splitting a dataset, information leakage should be prevented. The information leakage occurs when the model is trained using the training set contaminated with information from the validation or test sets that should be exclusive to the training set. To ensure that each set is divided at the patient level or higher, each set should be split from the study population at the beginning of the study before data pre-processing and model training. Details on the steps taken to prevent overfitting and information leakage, which result in poor generalization of the model, should be provided.
4) Concise description of the AI system
Scientific rationales should be used to determine the type of model to train. The model task (e.g., classification and regression [numerical prediction]) and its beneficiaries (if any) should be specified. The mathematical algorithm of the AI system; hardware environment; and supporting software, library, or package required for its operation, including the versions, should be described. Relevant information includes the name of the developer and/or manufacturer, their location, and specific configuration settings. If previous development/validation studies of the AI system used are available, they should be cited in the manuscript and presented in the same manner that information about the AI system used in the current study is presented. The provision of AI system information from development/validation studies allows for changes in the performance of the AI system to be assessed as the current study population differs from that of the development/validation studies. Using a standardized reporting system also allows for concise information about the AI model to be provided [15]. If a new unpublished mathematical model is developed and used in the study, a full description should be provided as an appendix or as supplementary material at the end of the manuscript or should be published in an established public database along with accession details, which either does not allow for the model development processes to be arbitrarily revised once they are registered in the database or mandates retaining a complete history of revisions.
Several sequential stages, which require considerable time and computing power, are required to construct the final AI model. Throughout this process, several versions of AI models might be created. If the AI system was used in a clinical trial [4] or its ability to make appropriate or optimal clinical decisions is being assessed [5], the version should be clearly indicated with a regulatory marker, such as a unique device identifier. If the version of the AI model was modified, this should be justified by scientific rationale, and the changes made to the original version should be described.
The architecture of the AI model should be fully described such that it can be reconstructed by other researchers. This includes the inputs, outputs, and components specific to the type of AI model. Scientific rationale for the selection of each component should be provided. The architecture of the AI model can be provided in code as supplementary data.
As an example, a convolutional neural network model consists of 1) an input layer (e.g., one-dimensional 5-min electrocardiogram waveform collected at 300 Hz) characterized by the number of nodes and number and size of batches; 2) convolutional and pooling layers, which are characterized by the number and order of the layers, number and size of the kernel(s) in each layer, type of activation function (e.g., rectifier activation function, softplus), type of pooling operations (e.g., max, min, average, hybrid), type of normalization of output from a previous layer (e.g., batch or layer normalization), and use of dropout layers and their dropout rates; 3) fully connected layers, which are characterized by the number of hidden layers and number of nodes from each layer, and the type of activation function, similar to that of the previous convolutional and pooling layers; 4) output layers (e.g., whether hypotension develops); 5) loss (objective) function (e.g., mean squared error, mean absolute error for regression models, binary or categorical cross-entropy loss for classification models) and model optimization algorithms (e.g., Adam optimizer, gradient descent) with their hyperparameters (e.g., learning rates [degree of error reduction], exponential decay rates, epsilon); 6) regularization algorithms (e.g., L1 regularization [Lasso regularization], L2 regularization [Ridge regularization], elastic net regularization that combines the two regularization techniques); 7) hyperparameter tuning strategy (e.g., grid search, random search); 8) stopping criteria for training (e.g., maximum number of epochs after which training processes stop regardless of whether the trained model converges, patience parameters [number of epochs during which validation performance is allowed to improve]); and 9) criteria used to select the model with the best performance.
5) Model training
Details on all the model training processes used should be provided such that other researchers could reproduce them. Providing them in code is strongly encouraged.
If training data augmentation is required (for images, text, audio, etc.), the techniques used should be described (e.g., geometric transformations, paraphrasing, introducing noise).
The initialization of the parameters in the AI model (to prevent issues such as vanishing or exploding gradients and to enhance the convergence speed and model performance) should be described. If the initial parameters are randomly drawn from a specific distribution (e.g., uniform or normal distribution), the distribution should be described with its key parameters (e.g., lower and upper bounds for a uniform distribution and mean and standard deviation for a normal distribution). If the initial parameters are obtained from a model previously trained on a different large dataset (transfer learning), the source of the initial parameters from the pre-trained model should be provided. When using both random initialization and transfer learning, the initialized parameters and the modality used to initialize them should be specified. If some parameters are obtained from transfer learning and cannot be modified, they should be indicated as frozen or restricted. In addition, details on the specific restrictions applied and the portion of training affected by the restrictions should be specified.
The convergence of the model should be monitored by checking whether the pre-defined stopping criteria for the training are satisfied by the best hyperparameter combinations. If convergence of the model is not achieved, reviewing the data quality, feature scaling, learning rate, batch size, model architecture, parameter initialization, regularization (if any), and gradient descent issues, such as vanishing or exploding gradient problems, optimization, and hyperparameters, is mandatory. For example, for a neural network model that fails to converge, the researcher can attempt to vary the number of hidden layers and their nodes, apply different network activation functions, or adjust the learning rate.
The metrics used for model validation should also be described in detail (e.g., sensitivity, specificity, positive predictive value [precision], negative predictive value, area under the receiver operating characteristic curve, mean squared error, root mean squared error, accuracy).
6) Model evaluation
If more than one AI model is trained and planned to be evaluated using a test set that is independent of both the training and validation sets, there must be a modality and parameters for assessing model performance, which are used to select the most relevant model that typically demonstrates the best performance (e.g., area under the receiver operating characteristic curve, accuracy, precision for classification, and mean squared error for regression [numeric prediction]). The metrics used to measure model performance should be presented with statistical uncertainty (e.g., 95% CI) and compared between models using appropriate statistical tests that determine the statistical significance of the metric differences, thereby addressing the clinical problems that the current study is meant to address. If CIs cannot be directly calculated owing to unknown error distributions, they can be non-parametrically estimated using bootstrapping.
As no single methodology is perfect for evaluating a model, using more than one technique is strongly recommended. However, authors should have the flexibility to choose the most appropriate evaluation method and should provide a rationale for the selection. Strictly adhering to a pre-defined or standardized evaluation protocol is discouraged.
If multiple models are determined to be the best-performing models, the final model selection should be justified. If the goal of the study is to construct an ensemble of models, descriptions of the three components of the ensemble method should be provided: 1) the allocation function that assigns training data (e.g., via bootstrapping sampling) to each model; 2) the combination function that reconciles the prediction disagreements among models (e.g., the final prediction is made by a majority vote from models, weighting votes from each model based on their performance, or learning various combinations of each model’s prediction [stacking]); and 3) a full description of each model in the ensemble, as mentioned above. If models have previously been published that address the same clinical problem being addressed in the current study, comparisons of the final model to those models should be provided. To assess the robustness of the study findings, the sensitivity of the AI model should be analyzed using different assumptions or various initial conditions.
Because misinterpretations of model outcomes can lead to biases and inappropriate applications in healthcare settings [8], the intended manner of interpreting or explaining the results of the AI model should be described (e.g., an AI model developed to predict intraoperative hypotension is used to predict the development of hypotension in intensive care unit settings).
3. Miscellaneous aspects of the AI model description
1) Defining features and response variables
If possible, using common data elements that provide standardized, uniform, and consistent names, definitions, formats, and coding of variables across studies that are compatible with different study settings is strongly recommended [16].
2) Sample size estimation
Whenever possible, the sample size required for the study should be calculated from the results of a pilot or previous study to achieve the predetermined statistical power at an acceptable type I error rate.
Results
1. Study data
Including a flowchart or diagram to show the inclusion or exclusion of participants and/or data at each stage based on their corresponding exclusion/inclusion criteria is strongly recommended. The number of included or excluded subjects and data along with the criteria used for inclusion/exclusion should be presented. The resulting number of participants or data included at each stage should also be reported. If a flowchart or diagram is provided in the Methods section, presenting it in the Results section is redundant.
When summarizing the technical characteristics of the dataset, authors should specify in the Methods section whether the dataset was prepared as planned. If the characteristics of each partitioned dataset with statistical comparisons are reported in the Methods section, reporting them again in the Results section is redundant.
Even minor differences in the input datasets have a significant impact on the output of the AI model (i.e., performance), and subsequently, on patient safety if the model is to be used for major clinical decision-making [17,18]. Therefore, caution is advised if the distribution of the data changes (dataset shift) [18] between the training and test sets or between study settings. To address this issue, the population characteristics of the training and test sets should be described and compared.
The baseline characteristics should be selectively reported according to the task of the AI model, factors influencing the study outcomes, or protection of privacy (e.g., age, sex, gender, race, ethnicity, socioeconomic status, geographical location, prevalence, distribution [categorization/severity], risk factors of the medical conditions of interest, features input into the AI model, and coexisting medical conditions relevant to the study).
The presence of missing data significantly affects model performance and contributes to ethical issues [19-21]. The degree of missing data depends on the study settings (e.g., computer simulation modeling vs. clinical settings). Accordingly, the quantity of missing data according to the data features input into the AI model should be clearly reported.
2. Model performance
1) Reporting metrics with statistical uncertainty
Metrics with statistical uncertainty and the significance of model performance on the training, validation, and test sets should be reported as planned in the Methods section. The types of metrics reported are dependent on the data type and models used in the study (e.g., F-score, Dice-Sørensen coefficient).
2) Clinical translation of model performance
In addition to the performance of the model itself, evaluating how its predictive performance translates into clinical outcomes with specific metrics such as sensitivity, specificity, positive predictive value, negative predictive value, area under the receiver operating characteristic curve, and numbers needed to treat is essential. Accordingly, a justification of the metrics selected should be presented with the scientific rationale. The performance of the final model can be statistically compared with that of the standard technique or baseline model.
3) Feature contribution analysis
The contribution of each feature to the predictive performance of the AI model should be clearly described. For example, a SHapley Additive exPlanations (SHAP) plot is useful for showing the impact of every feature from each sample on the output predicted by the model.
4) Sub-group performance
If a subgroup analysis was performed, the subgroups for which the AI model performed best and worst should be indicated. The performance of any important subgroup should also be reported, as mentioned above. To demonstrate the performance and limitations of a classification model, providing a confusion matrix that shows whether the predicted classification matches the actual classification can be helpful.
5) Sensitivity analysis
For a sensitivity analysis of the classification models, descriptions of cases that present the highest model confidence with correct and incorrect prediction and the lowest confidence regardless of prediction correctness can be provided. For example, in an 84-year-old male patient with hypertension and congestive heart failure who underwent emergent pneumonectomy, a classification model predicting reintubation in the post-anesthetic care unit calculated a model confidence of 95%, and his trachea was actually reintubated according to the ground truth. This case shows high model confidence and correct prediction, implying that the model correctly identified high-risk patients for reintubation with definite risk factors such as old age, multiple comorbidities, and surgery involving the respiratory tract.
Similarly, for regression models, a sensitivity analysis can be performed by describing cases with the largest difference (error) between a lower predicted value and a higher actual value, cases with the largest difference between a higher predicted value and a lower actual value, and cases with the smallest difference between the predicted and actual values. For example, a case in which the systolic blood pressure predicted by a regression model is 300 mmHg and the actual value is 150 mmHg may be identified as the case with the largest difference between a higher predicted value and a lower actual value.
6) Unsupervised model assessment
The results from unsupervised learning can be assessed by field experts for accuracy and relevance by comparing them with typical patterns. Thus, sensitivity analysis enhances the understanding of the model behavior, fosters transparency, and guides future improvements.
3. Use of an AI model in clinical practice
If an AI model is planned to be used as part of clinical practice or clinical decision-making, the adherence or non-adherence of investigators to the study protocols for the use of the AI model should be reported because it not only affects the study outcomes but also provides useful information for the implementation of the same AI model in subsequent studies. If possible, it would be helpful to describe an exemplary non-adherence case where the AI model could not be used, either deliberately or accidentally, even though it had been planned to be used.
Unexpected changes in or impacts on medical practice or patient experience, which are caused by using an AI model, should be reported because they may act as confounding factors. For example, laboratory tests and/or radiographic imaging required for the AI model, which are performed in addition to routine clinical practice; manual input of the data to the AI model interface; or manual retrieval of the AI output and subsequent recording of the output in the medical record, can increase patient discomfort and inconvenience, risks to patient safety, medical personnel workload, and/or time required before clinical decision-making and relevant clinical practice is performed. If any changes external to the implementation of the AI model are considered to have affected AI model performance and conduct of the study, they should be reported.
If any modification was made to the AI algorithm during the study, the kind of modification, the stage of the study that it was made, and its impact on the study outcomes should be fully reported.
If the clinical recommendations provided by the AI model are determined to be erroneous based on the ground truth and have the potential to threaten patient safety, the subsequent steps in the clinical pathway to the wrong recommendations should be interrupted and then correctly guided by the investigators. By contrast, a correct decision that disagrees with the investigators’ decision can be made using the AI model, highlighting its effectiveness. For an appropriate appraisal of the AI model for clinical decision-making, the quantity of both agreement and disagreement in the clinical decision between the AI model and the investigators, which are determined by the ground truth, should be reported.
Discussion
A summary of the study results, their contribution to advancing our knowledge, their clinical implications, and their impact on relevant academic fields should be described. Whether the use of the AI model is supported by the study results compared to previous studies or current standards can also be stated. Comparisons can be made by referring to the performance metrics presented in the Results section.
Study limitations should be stated regarding the study materials and methods, unanticipated results, statistical uncertainty, any kind of bias, generalizability of the study results, any issues and challenges preventing a wide application of the AI model to clinical fields, and questions that remain unanswered by the current work. By balancing the strengths and weaknesses (limitations) of the evidence provided, the extent of support for the tested AI model can be determined solely by its potential benefits.
The effects of human factors on model performance should be discussed. Future actions to be taken based on the study results should also be described (e.g., improvements to and/or modifications of the AI model for the next phase [widening its indications in different clinical settings]).
For safety issues, the authors should discuss the following: errors and risks related to the use of the AI model, adverse events and significant changes to the subsequent steps in the clinical pathways as a result, including whether they were attributable to errors, and the contributions of human factors to the errors. To mitigate these aspects for future studies, specific strategies, such as retraining or modification of the models, should be suggested with relevant rationales (e.g., model modification is recommended because it requires less time than retraining the model from scratch and has a higher likelihood of reducing risks compared to retraining). To obtain public trust in new technologies, all safety issues and strategies to mitigate or prevent them should be reported fully and transparently and discussed openly.
Public accessibility of the AI system, source code, and raw data
In the absence of the source code and data used to train the AI model, the model cannot be reproduced. The algorithm with relevant source code and training data, as well as the data collected during the study using the AI model, must be shared publicly so that the generalizability of the AI model can be evaluated transparently and unbiased comparisons with different models in different settings can be conducted.
To enable independent researchers to verify the code and replicate the results claimed by the original authors without modifying the code, the code should be provided in well-documented and easily understandable scripts or notebooks with clear and detailed explanations and annotations. Formatted raw data used as the model input, versions of libraries, packages, modules, or software components necessary for the code to function correctly, and any computer system configuration requirements should also be shared. Accessible links to repositories, contact information, and instructions for obtaining access should be provided.
Beyond the final outcome, generating as many intermediate results or outputs as possible at each stage of the model-building process can help independent researchers identify the specific steps at which replication may diverge from the original process. This detailed replication process enables other researchers to validate or adapt the model to their clinical cohorts, thereby accelerating the development of new, similar models for different clinical settings and thus establishing best clinical practices.
Unless the AI system is proprietary to commercial entities or governed by licenses that restrict its use, openly sharing the AI system and/or its code for public access and use is strongly recommended. If access to the AI system is restricted, the reason should be stated. If privacy protection issues limit access to the training data, at least the source code of the AI model should be publicly released.
For reference, an AI modeling checklist can be used, which categorizes the level of sharing on a 4-tier scale from fully open sharing to no sharing [22]. Reproducibility standards with three levels of computational reproducibility are also available [23]. Accordingly, model repositories and academic journals can set appropriate levels of sharing based on their policies, standards, and/or requirements.
Other information
1. Pre-registration of AI research to prevent p-hacking
P-hacking is closely associated with the issues of overfitting and information leakage in AI research. When researchers selectively report results by repeatedly performing analyses until significant outcomes are obtained [24], often by testing multiple model configurations with repetitive parameter tuning, they inadvertently commit overfitting, causing the model to capture noise or irrelevant patterns in the training data rather than the true characteristics that are generalizable to different data. This behavior undermines the performance of the model on unseen data, resulting in misleading predictive power [25]. Furthermore, p-hacking can increase the risk of information leakage, where data meant to remain independent from validation or testing inadvertently influence the training process, leading to artificially inflated performance metrics that fail to perform well in real-world settings. Pre-registration of AI research addresses these issues by requiring researchers to commit to a specific study design, data-handling procedures, and analysis methods before accessing the data. This commitment limits the flexibility that facilitates p-hacking and enforces strict guidelines for splitting the dataset to prevent information leakage, thereby promoting transparency in the research process that ensures true model performance and generalizability [26].
2. Safety issues related to errors following the use of AI models in medical practice
If the recommendations provided by the AI system are found to mislead clinical practice and affect patient safety and clinical outcomes, they can be rejected by the researchers. Authors should indicate who makes clinical decisions at each step along the clinical pathway.
Errors that occur as a result of using an AI algorithm are unforeseen and can cause catastrophic results when used on a large scale. Therefore, AI algorithm (performance) errors, errors external to AI model use, and human errors should be reported along with their occurrence rate, causes, and impacts on the clinical pathway/practice, study outcomes, and patient safety. How the errors were detected and handled should also be described so they can be corrected, along with whether the AI algorithm and/or human errors were detected before patient safety was jeopardized. Accordingly, efforts should be made to reduce the risks caused by these errors. Transparency in reporting and analyzing these errors helps to prevent repeating the same errors in future studies using the same AI model. It also helps to improve and upgrade the AI model.
Both the direct and indirect and expected and unexpected adverse events, which are attributed to errors and misuse of the AI model and even its correct use, should be reported along with the strategies used to mitigate them. The relevant risks to patient safety should also be identified and assessed. The safety profile of the AI model based on the aforementioned harms and risks can be the cornerstone of preventive measures to mitigate them in future studies and help determine appropriate target populations and timing for safe AI model use.
To demonstrate how well the investigators, who collected the data in AI model development and the data produced by the AI model, are trained, metrics for learning curves should be presented chronologically. A graphical representation is also encouraged so that other investigators can use the same AI model in different clinical settings.
Unless reporting errors and related safety profiles are planned or performed, the reasons for the omission should be explained.
3. Human errors in AI model use
Human errors in data preparation can significantly affect the performance of an AI system and its outcomes. Whether the data were prepared manually or based on an automatic operating algorithm must be stated. If the process is automated, the tools and algorithms should be fully described along with the parameters. If the input data are selectively acquired by a researcher, the researcher should be fully trained in appropriate data selection according to a standardized data selection protocol. Otherwise, ethical issues can be addressed in the case of adverse events. For example, a histological image from normal tissue rather than from a cancerous region may be input into an AI algorithm, thereby leading to misdiagnosis. In the absence of a standardized data selection protocol, whether the adverse events were caused by human errors in input data selection or algorithmic flaws in the AI system will be unclear. In addition, determining whether real clinical practice can accommodate the standardized data selection protocol is essential. Researchers using AI models in clinical practice should also be fully trained because data collected by poorly trained researchers who are not familiar with AI model use could bias the study results. Effectively presenting the training status could also involve evaluating the learning curves.
4. Errors external to the AI system
If the AI system is used for real clinical settings, the external factors influencing its performance should be considered. For example, any clinical decision that is made by the AI system and then approved by the researchers can be declined by patients for their own private reasons, irrelevant to the clinical conditions and medical treatments (e.g., financial conditions, religion, life perspective).
5. Ethical considerations regarding equity
Considerable effort should be made to assess and promote fairness and equity when developing an AI model, given that inequity is embedded in the current reference standard practice in the healthcare system. The rationale for these efforts should be described in accordance with the study goals. For example, the American Heart Association Get with the Guidelines-Heart Failure Risk Score systematically gives black patients low risk scores [27] without a rationale for this adjustment [28], making this population less likely to benefit from the cardiology service [29]. To address these issues, the AI model that is intended to predict heart failure risk should be appropriately adjusted by including a sufficient proportion of black patients and features relevant to this population (e.g., high blood pressure, left ventricular hypertrophy) in the input data [30].
Discussion
Unlike existing reporting standards, the guidelines and checklist developed in this study offer a comprehensive and versatile framework for reporting studies involving AI models in healthcare regardless of the specific study design. In particular, efforts have been made to ensure that essential components of AI research and reporting are not overlooked. However, these guidelines and checklist were developed solely by the two authors, both experts in statistics, and may not fully reflect diverse perspectives from other AI experts. Further refinement may thus be necessary to address these limitations and incorporate broader expertise.
In conclusion, these guidelines and checklist provide a valuable tool for reporting AI model research through enhancing transparency, ensuring reproducibility, and promoting the appropriate use of AI models in clinical practice.
Notes
Funding
None.
Conflicts of Interest
Sang Gyu Kwak and Jonghae Kim have been board members of the Statistical Rounds of the Korean Journal of Anesthesiology since 2016. However, they were not involved in any review process for this article, including peer reviewer selection, evaluation, or decision-making. No other potential conflict of interest relevant to this article was reported.
Data Availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Author Contributions
Sang Gyu Kwak (Conceptualization; Methodology; Project administration; Supervision; Validation; Writing – original draft; Writing – review & editing)
Jonghae Kim (Conceptualization; Methodology; Project administration; Resources; Supervision; Validation; Writing – original draft; Writing – review & editing)