Artificial Intelligence (AI), requires training for its application, especially in the medical field, conducting simulated interviews and natural language processing (NLP), is useful for this task.
Researchers published an article in Scientific Data of Nature, detailing the creation of datasets for AI training through medical conversations with an Objective Structured Clinical Examinations (OSCE) format. The investigation focused on respiratory cases and its objective was to provide a complete set of data o data set on medical talks to the medical research community.
There are generally limitations in the research and application of AI using data from medical conversations, as these require training that can interfere with patient privacy and data sharing regulations.
In this way, the authors of the article mentioned above developed a method for simulating medical conversations that is used to train AI applied to health. For this, a team of residents in internal medicine, physical medicine, anatomical pathology, and family medicine, as well as medical students, created this data set simulating medical interviews using the OSCE format.
The interviews were recorded and transcribed. More than 272 simulated conversations between doctors and patients were recorded and categorized into categories, however, most of them were simulated cases of respiratory cases.
Interview transcripts are useful for training various NLP models, for measuring the accuracy of transcription tools, among other uses. In this sense, the dataset presented by this research was able to correct common errors in the transcription of medical conversations, in the audio recording, making it useful and applicable to train any PLN model.
“More importantly, access to data of this caliber is a significant challenge for many researchers due to the sensitive nature of the data, government regulations that limit data sharing in research, and the question of monetization. of the data. Therefore, the presented dataset of complete medical conversations in audio and text format is a valuable asset for academia and the medical industry,” the authors explain.
However, one of the main limitations of this dataset is the small number of simulated cases of non-respiratory diseases. In fact, of the 272 simulated conversations, 214 corresponded to respiratory cases and the rest to cardiac, dermatological, gastrointestinal and musculoskeletal cases.
You can read the study in detail at the following link: https://www.nature.com/articles/s41597-022-01423-1