Multiple Choice

When does comparing to human performance still help an ML system that already surpasses average human-level accuracy on the dev/test set?