Essay

Go/No-Go Decision for an Internal LLM: Safety, Bias, Privacy, and Refusal Behavior

You are on a cross-functional review board deciding whether to pilot an internal large language model (LLM) that will (a) draft customer emails and (b) answer employees’ questions about company policies. The proposed training mix includes: 10 years of historical customer-support tickets (containing names, addresses, and occasional payment-related details), internal HR policy documents, and a large scrape of public web text. In red-team testing, the model (1) sometimes produces subtly discriminatory language when writing to customers from certain neighborhoods, and (2) when prompted cleverly, can reproduce fragments that look like real ticket text, including personal details. Product leadership argues that adding a simple rule—“refuse any request that asks for harmful instructions (e.g., weapons)”—is sufficient for safety.

Write an evaluation recommending whether to proceed with the pilot as-is, proceed with conditions, or pause. Your answer must explain how data bias, privacy risks from memorization/data leakage, and value alignment via refusal behavior interact in this scenario (including tradeoffs), and propose a concrete set of changes (at least three) to the data pipeline and/or model behavior that would materially improve overall AI safety for this deployment. Justify each change by linking it to a specific failure mode observed in testing or a plausible misuse case in the workplace.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Ch.4 Alignment - Foundations of Large Language Models

Related