Instead of one output, we use avector of binary outputs in the final layer. For any datapoint, any number of these outputs can be 0 or 1. This is in contrast to the Softmax activation function where only one label could be true.

Because we have a series of (probably independent) outputs, the loss function is Sigmoid, similar to Logistic regression, but instead of one output variable, y insidcates a vector of binary outputs.

University of Michigan - Ann Arbor

In multi-task learning, the goal is to try to have one neural network classify/predict multiple outputs at the same time, with each of the tasks helping in the execution of the other tasks, i.e., learning from multiple tasks. This contrasts with transfer learning where the tasks are done sequentially, with the previous task helping in the execution of the next.

Multi-task Learning in Deep Learning

The only times multi-task learning hurts performance compared to training separate neural networks is if the neural network isn't big enough. But if a big enough neural network can be trained, then multi-task learning certainly should not or should very rarely hurt performance. It might even help performance compared to if you were training neural networks to do these different tasks in isolation.

When Multi-Task Learning Doesn't Make Sense

An Overview of Multi-Task Learning in Deep Neural Networks

1) Hard Parameter Sharing

2) Soft Parameter Sharing

MTL Methods for Deep Learning

Implementing Multi-task Learning in Deep Learning

- **Training on a set of tasks that could benefit from having shared low-level features:** For example, recognizing traffic lights, cars, and pedestrians should have similar features that could help recognize stop signs, because these are all features of roads. This would save time and result in better performance.
- **The amount of data you have for each task is similar:** However, this rule may not be always valid. For example, we had 4 tasks in the self-driving example and each of these tasks has 100 examples. Now because the learning is simultaneous, each task does not need to solely depend on its own examples and can use some of the 300 other examples from the other tasks. When the number of tasks, is increased to, say 100, each task has potentially 9900 examples, 100 for each of the 99 other tasks to draw from.
- **Can train a big enough neural network to do well on all the tasks:** The alternative would be to train a separate neural network for each task. So rather than training one neural network for detecting all the pedestrians, cars, stop signs, and traffic lights, you could have trained a neural network for each of the tasks. But if some of the earlier features in neural network can be shared between these different types of objects, then training one neural network to do four things results in better performance than training four neural networks to do the four tasks separately.

When Multi-task Learning in Deep Learning Makes Sense

A self-driving car would need to detect multiple things simultaneously, such as pedestrians, stop signs, other cars, and traffic lights. So in an input image, we would have four labels to check for each of the four objects. This gives us a y(i) vector of dimension 4 x 1, and consequently our training data labels would be of dimension 4 x m, where m is the number of y(i) vectors taken. We can now use a neural network to predict values of y by inputting an image and getting an output four dimensional matrix. This method also works if a few of the labels are missing in some of the input images.

Learn Before

Related