QUICK
MENU
맨위로가기

로고이미지로고이미지

대학장학금입학안내기숙사등록가톨릭대학교대학원입학안내

The Catholic University of Korea

Research Results


Department of Data Science Professor Kangmin Kim’s Research Team Presents Two Papers at EMNLP 2025

  • Writer :External Affairs Team
  • Date :2025.11.24
  • Views :138

  • Proposed key techniques to address bias and hallucination issues in large language models

  • Presented research outcomes at EMNLP, selected by the Korea Information Science Society as a top international conference in the field of Computer Science


Photo (left): Master’s student Park Sung-jin presenting at EMNLP 2025, held in Suzhou, China, from November 4 to 9
Photo (right): Students Park Su-hyung and Kim Ho-beom

The research team of Professor Kangmin Kim from the Department of Data Science and the Department of Artificial Intelligence at The Catholic University of Korea presented two papers at Empirical Methods in Natural Language Processing 2025 (EMNLP 2025), one of the world’s leading international conferences in natural language processing. The papers proposed techniques to mitigate bias and hallucination issues in large language models (LLMs).

◆ Development of techniques for measuring and mitigating political bias in large language models with respect to news outlet names

(Figure 1) Overview of the methodology and results for measuring political bias in large language models with respect to news outlet names

The first paper, “Measuring and Mitigating Media Outlet Name Bias in Large Language Models,” systematically measured the political bias that large language models such as ChatGPT exhibit when processing news articles based solely on the name of the media outlet, and proposed techniques to mitigate this bias.

The research team confirmed through experiments the presence of an “anchoring effect,” in which the model’s judgment changes depending on the media outlet name given as the source—for example, interpreting the same article more progressively when attributed to CNN and more conservatively when attributed to Fox News. To quantitatively assess this bias, the team proposed a new metric called SIPS (Source-Induced Prediction Shift), which integrates absolute sensitivity, agreement, and consistency, allowing the degree of model bias to be expressed as a value between 0 and 1.

Experimental results revealed that major language models—including Qwen-2.5, Mistral-Small, Phi-4, Llama-3.3, Gemma-2, and GPT-4.1—all exhibit media-outlet-based bias. Notably, the bias tended to be stronger in larger models and in models that underwent alignment tuning such as RLHF (Reinforcement Learning from Human Feedback).

The team also developed an automated prompt optimization framework to test the potential for bias mitigation. As a result, the SIPS score of Qwen-2.5 decreased from 0.529 to 0.279, and that of GPT-4.1 decreased from 0.421 to 0.293, demonstrating meaningful improvements.

◆ Development of a knowledge-graph-based, context-aware medical counseling framework

(Figure 2) Overall architecture of the knowledge-graph-based, context-aware medical consultation framework

The second paper, ‘Leveraging Knowledge Graph-Enhanced LLMs for Context-Aware Medical Consultation’, proposes the ILlama (Informatics Llama) framework to alleviate hallucination issues in large language models in the medical domain and provide more accurate medical consultations.

Existing medical consultation systems such as ChatDoctor rely on keyword-based search, which often fails to retrieve sufficient relevant medical information and can lead to the generation of inaccurate content. To address this limitation, the research team leveraged a structured knowledge graph built on UMLS (Unified Medical Language System), a standard medical terminology system.

At the core of ILlama is a subgraph-based retrieval method that structurally represents causal and semantic relationships between diseases and symptoms. The team constructed a UMLS knowledge graph consisting of approximately 20,000 medical concepts, 22 types of relations, and 250,000 triples, and stored each subgraph in a vector database. When a patient’s query is entered, the system retrieves the most semantically relevant medical knowledge and uses it to generate answers. For example, if a patient reports “cough, shortness of breath, and fatigue,” ILlama identifies, via the knowledge graph, how these symptoms are associated with specific diseases such as lung cancer or anemia.

In performance evaluations, ILlama achieved an F1 score of 0.884 for semantic similarity on the HealthCareMagic dataset, outperforming all existing baseline models. In real-world testing using the iCliniq dataset, it also recorded a high score of 0.871, demonstrating excellent generalization performance. A qualitative evaluation using the OpenAI o1 model further confirmed that hallucinations were significantly reduced and clinical usefulness improved.

Professor Kangmin Kim of the Department of Data Science and the Department of Artificial Intelligence at The Catholic University of Korea stated, “This research presents practical methods to address two key challenges of large language models: bias and hallucination,” adding, “Our media-outlet bias mitigation technique will enhance the fairness of AI-based news services, while the medical consultation system will contribute to ensuring patient safety.”

This research was conducted by a team led by Professor Kangmin Kim of the Department of Data Science and the Department of Artificial Intelligence at The Catholic University of Korea, together with Master’s student Park Sung-jin and undergraduate students Park Su-hyung and Kim Ho-beom, with support from the Excellent Young Researcher Program of the National Research Foundation of Korea and a project funded by the Institute of Information & Communications Technology Planning & Evaluation (IITP).