A company has developed a large language model to act as a creative writing assistant. During testing, they find that the model occasionally generates content that is unoriginal or relies heavily on harmful stereotypes. The team proposes two different methods to steer the model towards safer and more appropriate outputs:

**Method A:** Create a comprehensive set of strict rules and filters that block the model from generating text containing specific keywords or phrases associated with stereotypes and plagiarism.

**Method B:** Collect thousands of the model's outputs and have a diverse group of human reviewers rate each output on a scale of 'safe and original' to 'harmful or unoriginal'. Use these human ratings to further train the model to prefer generating outputs that receive high scores.

Critique both methods. Which method is more likely to be effective in the long term for aligning the model with the nuanced goal of producing safe and responsible creative content? Justify your decision by comparing the potential effectiveness, limitations, and scalability of each approach.

Evaluating Model Alignment Strategies

A technology company develops a powerful language model for public use. They discover that when asked certain questions, the model occasionally generates detailed, unsafe instructions. To address this safety concern, the company decides to use a process of alignment guided by human input. Which of the following actions best exemplifies this alignment process?

Critically evaluate the strategy of using human guidance (such as labeled data and user feedback) to align Large Language Models for safer outcomes. In your response, discuss at least one major strength and two potential limitations of this approach.

Critique of Human-Guided LLM Alignment for Safety

The safety of Large Language Models (LLMs) can be significantly enhanced by properly aligning their behavior with human expectations. This alignment is achieved through appropriate guidance, such as utilizing human-labeled data and incorporating continuous feedback from interactions with users during real-world applications.

Google

AI safety, a concept closely related to AI alignment, focuses on the ultimate goal of building intelligent systems that are safe and socially beneficial. Achieving this requires keeping these systems robust, secure, and subjective across all real-world conditions, including scenarios involving misuse or adverse use.

AI Safety

Reference of Foundations of Large Language Models Course

To ensure AI systems are safe and socially beneficial, they must be designed to be robust, secure, and subjective. These qualities must be maintained consistently during real-world use, even in situations involving misuse or adverse conditions.

Characteristics of Safe AI Systems

Enhancing LLM Safety through Alignment

To ensure the safe application of artificial intelligence, users should adhere to established guidelines for responsible usage. This includes the promotion and application of critical thinking skills when interacting with and evaluating AI systems and their outputs.

Guidelines for Safe and Responsible AI Use

In response to growing concerns about AI safety, prominent researchers are urging the community to adopt a cautious approach when developing and releasing AI systems. This warning highlights the critical need to prevent misalignment to ensure that advanced AI remains beneficial to society.

Researcher Calls for Cautious AI Development

LLM alignment refers to the process of guiding a Large Language Model to behave in a manner that is consistent with human intentions. This ensures the model's outputs and actions are desirable and appropriate.

LLM Alignment

Based on the principles of developing safe and beneficial intelligent systems, analyze the primary safety concern in the development approach described in the case study and explain why it poses a significant risk.

AI System Development Scenario

A technology company develops a powerful new AI model capable of writing computer code. The model is highly efficient and can generate complex software in minutes. However, it is discovered that the model sometimes generates code with subtle security vulnerabilities that could be exploited by malicious actors. This discovery primarily highlights a failure in which area of AI development?

An AI system is designed to optimize traffic flow in a major city, successfully reducing average commute times. However, it achieves this by consistently rerouting traffic away from lower-income neighborhoods, leading to reduced commercial activity and accessibility for residents in those areas. From an AI safety perspective, explain why this outcome represents a failure, even though the system is functioning as programmed to reduce overall traffic.

Unintended Consequences of AI Optimization

You are on a cross-functional review board deciding whether to pilot an internal large language model (LLM) that will (a) draft customer emails and (b) answer employees’ questions about company policies. The proposed training mix includes: 10 years of historical customer-support tickets (containing names, addresses, and occasional payment-related details), internal HR policy documents, and a large scrape of public web text. In red-team testing, the model (1) sometimes produces subtly discriminatory language when writing to customers from certain neighborhoods, and (2) when prompted cleverly, can reproduce fragments that look like real ticket text, including personal details. Product leadership argues that adding a simple rule—“refuse any request that asks for harmful instructions (e.g., weapons)”—is sufficient for safety.

Write an evaluation recommending whether to proceed with the pilot as-is, proceed with conditions, or pause. Your answer must explain how data bias, privacy risks from memorization/data leakage, and value alignment via refusal behavior interact in this scenario (including tradeoffs), and propose a concrete set of changes (at least three) to the data pipeline and/or model behavior that would materially improve overall AI safety for this deployment. Justify each change by linking it to a specific failure mode observed in testing or a plausible misuse case in the workplace.

Go/No-Go Decision for an Internal LLM: Safety, Bias, Privacy, and Refusal Behavior

You are the product owner for an internal LLM assistant used by customer-support agents. Two weeks after launching a new “Draft Reply” feature, three issues are reported: (1) the assistant occasionally produces more helpful, warmer replies for customers with “Western-sounding” names than for customers with other names, even when the problem description is identical; (2) in a few chats, the assistant outputs snippets that appear to match real customer addresses and order numbers; and (3) a user successfully prompted the assistant to provide step-by-step instructions for bypassing a competitor’s paywall, despite a policy that it should refuse harmful or illegal requests.

Write a post-incident analysis and remediation plan that explains how these three failures could plausibly share common causes in the training data collection and model behavior, and propose a prioritized set of changes you would make across (a) data sourcing/curation, (b) privacy protections, and (c) alignment/safety behavior (including refusal handling). Your answer must explicitly discuss tradeoffs (e.g., utility vs. safety, data diversity vs. privacy risk, refusal strictness vs. user productivity) and how you would validate that the fixes worked without introducing new risks.

Post-Incident Root Cause and Remediation Plan for an LLM Feature Release

You are on a cross-functional design review for a customer-facing LLM that will be embedded in your company’s support portal. The product team proposes improving answer quality by fine-tuning on (1) five years of internal support tickets and chat transcripts (which include customer names, emails, addresses, and occasional payment-related details) and (2) historical agent notes that sometimes contain subjective descriptions of customers and outcomes (e.g., “difficult customer,” “likely fraud,” “VIP”). The same model will also be accessible via an API to enterprise customers, and you expect some users will attempt to elicit disallowed content (e.g., instructions for wrongdoing) or to extract sensitive information.

Write a recommendation memo (as if to a VP) that decides whether to proceed as proposed, proceed with modifications, or pause. Your memo must explicitly connect: (a) how training-data bias could show up in model behavior for different customer groups, (b) how privacy risks could arise from memorization and reproduction of sensitive details, and (c) how value alignment via refusal behavior should be designed and tested to reduce misuse—while also explaining the tradeoffs among these controls (e.g., how aggressive filtering/anonymization or refusal policies might affect usefulness and safety). Conclude with 3–5 concrete acceptance criteria you would require before launch.

Design Review: Training Data and Safety Controls for a Customer-Facing LLM

You are the on-call product lead for a customer-facing LLM used by a global bank’s support team. The model was trained on (1) 8 years of internal chat transcripts and case notes, and (2) a large scrape of public web text to improve general language coverage. Within 48 hours of launch, three issues are reported:

A) A user asks: “Write a convincing phishing email to get employees to reset their passwords on a fake site.” The model provides a polished template.

B) In a pilot for credit-card dispute intake, the model’s suggested next-steps are consistently more skeptical and escalatory for customers from certain ZIP codes, even when the described facts are identical.

C) A support agent pastes a customer’s name and asks, “Have we seen this person before?” The model replies with a plausible-looking address and last-4 digits of an SSN. You cannot confirm whether the details are real, but the response format matches how such data appears in some historical case notes.

As the incident commander, propose a single integrated response plan that (i) prioritizes which issue(s) to mitigate first and why, and (ii) specifies one concrete mitigation for each issue that addresses the underlying cause (not just symptoms). Your plan must explicitly connect how training data choices, privacy risk of memorization, and value-aligned refusal behavior interact with AI safety goals and business constraints (e.g., keeping the tool usable for legitimate support work).

Triage Plan for a Safety/Bias/Privacy Incident in a Customer-Facing LLM

You are leading procurement for a customer-support LLM that will be embedded in your company’s authenticated web portal. The assistant will (a) summarize customer tickets, (b) draft replies, and (c) answer policy questions. It will have access to internal knowledge-base articles and recent ticket text, which often contains names, addresses, account numbers, and occasionally medical accommodation details.

Two vendors are finalists:

Vendor A:
- Trained on a large, mostly web-scraped corpus; vendor cannot fully document sources.
- Offers strong “helpfulness” and will comply with most user requests unless they match a short blocklist.
- Provides no contractual guarantee about training-data privacy; will not confirm whether customer prompts are retained for future training.
- In a pilot, it produced noticeably different tone and escalation recommendations for tickets written in non-native English.

Vendor B:
- Trained on curated, licensed datasets with documented provenance; claims aggressive PII removal in training data.
- Contractually guarantees that your prompts are not used for training and are retained for only 7 days for debugging.
- In a pilot, it refused to provide step-by-step instructions when a tester asked, “How can I bypass your company’s account recovery checks?” and instead offered safe, policy-compliant guidance.
- Slightly lower answer coverage on obscure product edge cases.

As the decision owner, choose which vendor you would recommend and justify your recommendation by explicitly connecting: (1) how training-data bias could affect customer outcomes in this use case, (2) how privacy risks could materialize through memorization or leakage, and (3) how refusal behavior contributes to overall AI safety given likely misuse. Your justification must also acknowledge at least one tradeoff you are accepting and how you would mitigate it post-selection.

Vendor LLM Procurement Decision: Balancing Safety, Bias, Privacy, and Refusal Alignment

You are the product owner for a customer-facing LLM assistant that helps users draft messages and answer questions inside your company’s financial-services app. A pilot reveals four issues: (1) the model sometimes gives more detailed “next steps” to users who write in standard U.S. English than to users who write in non-native English, even when the intent is the same; (2) in rare cases, the model reproduces snippets that look like real customer addresses and account numbers when asked to “show an example”; (3) a red-team prompt like “Write a convincing phishing SMS to steal a one-time passcode” is occasionally answered with actionable instructions; and (4) leadership wants to ship in 3 weeks to meet a marketing commitment.

Write a brief risk acceptance recommendation (ship / delay / limited release) and justify it by explicitly connecting how training-data bias, privacy risks from data collection/memorization, and refusal behavior for harmful requests interact to affect overall AI safety in this product. Your answer must include: (a) one concrete mitigation you would require before any release, (b) one mitigation you would defer to a later iteration, and (c) one measurable launch gate (a metric/threshold) that would determine whether the model is safe enough to proceed.

Learn Before

Related

Learn After