Multiple Choice

A research team starts with a large language model pre-trained on a massive, diverse text corpus, which shows strong performance across a general language understanding benchmark. They then fine-tune this model on a small, highly specialized dataset for classifying medical research abstracts. After fine-tuning, the model achieves 99% accuracy on the medical abstract test set, but when re-evaluated on the original general language benchmark, its performance has dropped by 20%. What is the most likely explanation for this outcome?

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science