Evaluating Data Sourcing for a Spam Filter
A machine learning team is building a spam filter for a new global email service set to launch next month. They need to create training and test datasets to develop and validate their model. They have two options for sourcing their data. Evaluate the two options below and recommend which one is more likely to result in a model that performs well on real-world user emails after the service launches. Justify your recommendation based on the relationship between the sourced data and the data the model will encounter in production.
0
1
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Training Error and Test Error
Data Sampling Notation from a Distribution
Conditional Probability of Pairwise Preference
A team develops a model to predict customer churn using historical data from 2019-2021. The model performs exceptionally well on a portion of this historical data set aside for testing. However, when deployed to predict churn for customers in 2023, its performance is poor. A major new loyalty program was introduced at the beginning of 2023, altering customer retention patterns. Which of the following statements best analyzes the most likely reason for this discrepancy?
A data scientist is tasked with building a model to predict real estate prices for an entire metropolitan area. To do this, they must create a training set and a test set. Which of the following data collection and splitting strategies presents the most significant risk of violating the fundamental assumption that both datasets are drawn from the same underlying probability distribution?
Evaluating Data Sourcing for a Spam Filter