Learn Before
Data Debiasing by Balancing Categories
A common technique to reduce bias in training data involves balancing the representation of different linguistic categories. This process aims to create a more equitable distribution for phenomena like gender, ethnicity, and dialects within the dataset.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Gender Bias in LLMs from Data Imbalance
Data Debiasing by Balancing Categories
Cultural Bias from English-Centric LLM Training Data
Mitigating Bias Through Data Diversity
A financial institution develops a language model to automate loan application approvals. The model is trained on the institution's loan approval data from the last 20 years. During testing, it is discovered that the model denies loans to applicants from certain low-income neighborhoods at a significantly higher rate than other applicants, even when their financial profiles (e.g., credit score, income) are identical. What is the most likely cause of this biased outcome?
Analyzing Bias in an AI-Powered Hiring Tool
Analyzing Potential Bias in a Scientific Summarization Model
You are the product owner for a customer-support L...
You are the risk lead for a company rolling out an...
You lead an internal review board deciding whether...
Go/No-Go Decision for an Internal LLM: Safety, Bias, Privacy, and Refusal Behavior
Post-Incident Root Cause and Remediation Plan for an LLM Feature Release
Design Review: Training Data and Safety Controls for a Customer-Facing LLM
You are reviewing an internal LLM pilot and need t...
Triage Plan for a Safety/Bias/Privacy Incident in a Customer-Facing LLM
Vendor LLM Procurement Decision: Balancing Safety, Bias, Privacy, and Refusal Alignment
Pre-Launch Risk Acceptance Memo for a Regulated-Industry LLM Assistant
Learn After
Evaluating a Data Balancing Strategy
A development team is working to mitigate gender bias in a large text dataset. Their sole strategy is to ensure the dataset contains an equal number of sentences mentioning male-associated pronouns (e.g., 'he', 'him') and female-associated pronouns (e.g., 'she', 'her'). Which of the following describes the most significant potential pitfall of relying exclusively on this category balancing method?
An AI development team is building a sentiment analysis model for customer reviews of a global product. They discover their initial training data is composed of 85% reviews from North American English speakers and only 5% from Indian English speakers, resulting in significantly lower accuracy for the latter group. To address this issue by directly modifying the dataset's composition, which of the following actions best exemplifies the technique of balancing data categories?