Learn Before
Evaluating auxiliary data for NYC housing price prediction
Case context: You are building a machine learning model to predict housing prices in New York City based on house size (feature x). Your primary dataset is small, so you consider adding a second, larger dataset of housing prices from Detroit, Michigan. Housing prices in Detroit are generally much lower than in NYC for houses of the same size.
Question: Should you include the Detroit housing dataset in your training set? Justify your decision based on the concept of data consistency.
Sample answer: No, you should not include the Detroit housing dataset. The Detroit data is an inconsistent auxiliary data source for predicting NYC housing prices. Because the same input feature (house size) corresponds to a very different target label (price) depending on the city, the data sources are not consistent. Mixing them will cause the model to learn incorrect price associations, ultimately hurting the model's performance when predicting prices specifically for the New York City market.
Key points:
- The Detroit dataset should be left out of the training set.
- The Detroit data is inconsistent with the NYC data.
- The same input feature maps to different labels (prices) in the two cities.
- Including the data will hurt predictive performance for the NYC target task.
Rubric: The answer must explicitly recommend excluding the Detroit data and justify the decision by identifying the data as inconsistent because the mapping from house size to price differs significantly between the two cities.
0
1
References
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Machine Learning Yearning (Deeplearning.ai)
Tags
Machine Learning
Deep Learning
Supervised Learning
Dive into Deep Learning @ D2L
Data Science
Machine Learning Strategy
Machine Learning Yearning @ DeepLearning.AI
Related
Adding a Source Indicator Feature for Inconsistent Data
Effect of mixing inconsistent Detroit housing data when predicting NYC prices
Consistency of housing price data between NYC and Detroit
Handling _____ auxiliary data in target task training
Terms related to inconsistent auxiliary data sources
Decision process for evaluating auxiliary data consistency
When is an auxiliary data source inconsistent with the target task?
Performance impact of mixing inconsistent datasets
Relative pricing of Detroit housing compared to _____ prices
Matching scenarios with their consistency classification
Sequence explaining why mixing Detroit and NYC data hurts performance
Analyzing the impact of inconsistent auxiliary data on a target task
Evaluating auxiliary data for NYC housing price prediction
Defining inconsistent auxiliary data sources