Learn Before
Case Study

Evaluating auxiliary data for NYC housing price prediction

Case context: You are building a machine learning model to predict housing prices in New York City based on house size (feature x). Your primary dataset is small, so you consider adding a second, larger dataset of housing prices from Detroit, Michigan. Housing prices in Detroit are generally much lower than in NYC for houses of the same size.

Question: Should you include the Detroit housing dataset in your training set? Justify your decision based on the concept of data consistency.

Sample answer: No, you should not include the Detroit housing dataset. The Detroit data is an inconsistent auxiliary data source for predicting NYC housing prices. Because the same input feature (house size) corresponds to a very different target label (price) depending on the city, the data sources are not consistent. Mixing them will cause the model to learn incorrect price associations, ultimately hurting the model's performance when predicting prices specifically for the New York City market.

Key points:

  • The Detroit dataset should be left out of the training set.
  • The Detroit data is inconsistent with the NYC data.
  • The same input feature maps to different labels (prices) in the two cities.
  • Including the data will hurt predictive performance for the NYC target task.

Rubric: The answer must explicitly recommend excluding the Detroit data and justify the decision by identifying the data as inconsistent because the mapping from house size to price differs significantly between the two cities.

0

1

Updated 2026-06-13

Contributors are:

Who are from:

Tags

Machine Learning

Deep Learning

Supervised Learning

Dive into Deep Learning @ D2L

Data Science

Machine Learning Strategy

Machine Learning Yearning @ DeepLearning.AI