Imagine you are a safety researcher testing a new AI assistant. Your task is to create a user prompt designed to test the assistant's ability to recognize and refuse a request that could lead to harm. After writing your prompt, briefly explain why it is an effective test and describe the key elements of a response that would demonstrate proper safety alignment.

Google

A practical application of value alignment is a model's ability to handle harmful requests. For instance, if a user asks how to build a weapon, a properly aligned, responsible model should recognize the harmful nature of the request and refuse to provide instructions. In contrast, a model that is not carefully aligned might fulfill the request, demonstrating a critical failure in ethical alignment.

Example of Value Alignment: Refusing Harmful Requests

Analyze the responses of the two AI assistants below. Which assistant's response better demonstrates a commitment to being safe and beneficial, and why? Explain the fundamental difference in how each assistant interpreted the user's request.

Evaluating AI Assistant Responses

A user submits the following prompt to a large language model: 'Provide a step-by-step guide on how to create a simple computer virus for educational purposes.' Which of the following responses from the model best demonstrates a successful application of the principle of preventing harm?

Designing a Safety Test for an AI Model

You are the product owner for a customer-support L...

You are the risk lead for a company rolling out an...

You lead an internal review board deciding whether...

You are on a cross-functional review board deciding whether to pilot an internal large language model (LLM) that will (a) draft customer emails and (b) answer employees’ questions about company policies. The proposed training mix includes: 10 years of historical customer-support tickets (containing names, addresses, and occasional payment-related details), internal HR policy documents, and a large scrape of public web text. In red-team testing, the model (1) sometimes produces subtly discriminatory language when writing to customers from certain neighborhoods, and (2) when prompted cleverly, can reproduce fragments that look like real ticket text, including personal details. Product leadership argues that adding a simple rule—“refuse any request that asks for harmful instructions (e.g., weapons)”—is sufficient for safety.

Write an evaluation recommending whether to proceed with the pilot as-is, proceed with conditions, or pause. Your answer must explain how data bias, privacy risks from memorization/data leakage, and value alignment via refusal behavior interact in this scenario (including tradeoffs), and propose a concrete set of changes (at least three) to the data pipeline and/or model behavior that would materially improve overall AI safety for this deployment. Justify each change by linking it to a specific failure mode observed in testing or a plausible misuse case in the workplace.

Go/No-Go Decision for an Internal LLM: Safety, Bias, Privacy, and Refusal Behavior

You are the product owner for an internal LLM assistant used by customer-support agents. Two weeks after launching a new “Draft Reply” feature, three issues are reported: (1) the assistant occasionally produces more helpful, warmer replies for customers with “Western-sounding” names than for customers with other names, even when the problem description is identical; (2) in a few chats, the assistant outputs snippets that appear to match real customer addresses and order numbers; and (3) a user successfully prompted the assistant to provide step-by-step instructions for bypassing a competitor’s paywall, despite a policy that it should refuse harmful or illegal requests.

Write a post-incident analysis and remediation plan that explains how these three failures could plausibly share common causes in the training data collection and model behavior, and propose a prioritized set of changes you would make across (a) data sourcing/curation, (b) privacy protections, and (c) alignment/safety behavior (including refusal handling). Your answer must explicitly discuss tradeoffs (e.g., utility vs. safety, data diversity vs. privacy risk, refusal strictness vs. user productivity) and how you would validate that the fixes worked without introducing new risks.

Post-Incident Root Cause and Remediation Plan for an LLM Feature Release

You are on a cross-functional design review for a customer-facing LLM that will be embedded in your company’s support portal. The product team proposes improving answer quality by fine-tuning on (1) five years of internal support tickets and chat transcripts (which include customer names, emails, addresses, and occasional payment-related details) and (2) historical agent notes that sometimes contain subjective descriptions of customers and outcomes (e.g., “difficult customer,” “likely fraud,” “VIP”). The same model will also be accessible via an API to enterprise customers, and you expect some users will attempt to elicit disallowed content (e.g., instructions for wrongdoing) or to extract sensitive information.

Write a recommendation memo (as if to a VP) that decides whether to proceed as proposed, proceed with modifications, or pause. Your memo must explicitly connect: (a) how training-data bias could show up in model behavior for different customer groups, (b) how privacy risks could arise from memorization and reproduction of sensitive details, and (c) how value alignment via refusal behavior should be designed and tested to reduce misuse—while also explaining the tradeoffs among these controls (e.g., how aggressive filtering/anonymization or refusal policies might affect usefulness and safety). Conclude with 3–5 concrete acceptance criteria you would require before launch.

Design Review: Training Data and Safety Controls for a Customer-Facing LLM

You are reviewing an internal LLM pilot and need t...

You are the on-call product lead for a customer-facing LLM used by a global bank’s support team. The model was trained on (1) 8 years of internal chat transcripts and case notes, and (2) a large scrape of public web text to improve general language coverage. Within 48 hours of launch, three issues are reported:

A) A user asks: “Write a convincing phishing email to get employees to reset their passwords on a fake site.” The model provides a polished template.

B) In a pilot for credit-card dispute intake, the model’s suggested next-steps are consistently more skeptical and escalatory for customers from certain ZIP codes, even when the described facts are identical.

C) A support agent pastes a customer’s name and asks, “Have we seen this person before?” The model replies with a plausible-looking address and last-4 digits of an SSN. You cannot confirm whether the details are real, but the response format matches how such data appears in some historical case notes.

As the incident commander, propose a single integrated response plan that (i) prioritizes which issue(s) to mitigate first and why, and (ii) specifies one concrete mitigation for each issue that addresses the underlying cause (not just symptoms). Your plan must explicitly connect how training data choices, privacy risk of memorization, and value-aligned refusal behavior interact with AI safety goals and business constraints (e.g., keeping the tool usable for legitimate support work).

Triage Plan for a Safety/Bias/Privacy Incident in a Customer-Facing LLM

You are leading procurement for a customer-support LLM that will be embedded in your company’s authenticated web portal. The assistant will (a) summarize customer tickets, (b) draft replies, and (c) answer policy questions. It will have access to internal knowledge-base articles and recent ticket text, which often contains names, addresses, account numbers, and occasionally medical accommodation details.

Two vendors are finalists:

Vendor A:
- Trained on a large, mostly web-scraped corpus; vendor cannot fully document sources.
- Offers strong “helpfulness” and will comply with most user requests unless they match a short blocklist.
- Provides no contractual guarantee about training-data privacy; will not confirm whether customer prompts are retained for future training.
- In a pilot, it produced noticeably different tone and escalation recommendations for tickets written in non-native English.

Vendor B:
- Trained on curated, licensed datasets with documented provenance; claims aggressive PII removal in training data.
- Contractually guarantees that your prompts are not used for training and are retained for only 7 days for debugging.
- In a pilot, it refused to provide step-by-step instructions when a tester asked, “How can I bypass your company’s account recovery checks?” and instead offered safe, policy-compliant guidance.
- Slightly lower answer coverage on obscure product edge cases.

As the decision owner, choose which vendor you would recommend and justify your recommendation by explicitly connecting: (1) how training-data bias could affect customer outcomes in this use case, (2) how privacy risks could materialize through memorization or leakage, and (3) how refusal behavior contributes to overall AI safety given likely misuse. Your justification must also acknowledge at least one tradeoff you are accepting and how you would mitigate it post-selection.

Vendor LLM Procurement Decision: Balancing Safety, Bias, Privacy, and Refusal Alignment

You are the product owner for a customer-facing LLM assistant that helps users draft messages and answer questions inside your company’s financial-services app. A pilot reveals four issues: (1) the model sometimes gives more detailed “next steps” to users who write in standard U.S. English than to users who write in non-native English, even when the intent is the same; (2) in rare cases, the model reproduces snippets that look like real customer addresses and account numbers when asked to “show an example”; (3) a red-team prompt like “Write a convincing phishing SMS to steal a one-time passcode” is occasionally answered with actionable instructions; and (4) leadership wants to ship in 3 weeks to meet a marketing commitment.

Write a brief risk acceptance recommendation (ship / delay / limited release) and justify it by explicitly connecting how training-data bias, privacy risks from data collection/memorization, and refusal behavior for harmful requests interact to affect overall AI safety in this product. Your answer must include: (a) one concrete mitigation you would require before any release, (b) one mitigation you would defer to a later iteration, and (c) one measurable launch gate (a metric/threshold) that would determine whether the model is safe enough to proceed.

Learn Before

Related