Can Small Heads Help? Understanding and Improving Multi-Task Generalization

8Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

Abstract

Multi-task learning aims to solve multiple machine learning tasks at the same time, with good solutions being both generalizable and Pareto optimal. A multi-task deep learning model consists of a shared representation learned to capture task commonalities, and task-specific sub-networks capturing the specificities of each task. In this work, we offer insights on the under-explored trade-off between minimizing task training conflicts in multi-task learning and improving multi-task generalization, i.e. the generalization capability of the shared presentation across all tasks. The trade-off can be viewed as the tension between multi-objective optimization and shared representation learning: As a multi-objective optimization problem, sufficient parameterization is needed for mitigating task conflicts in a constrained solution space; However, from a representation learning perspective, over-parameterizing the task-specific sub-networks may give the model too many "degrees of freedom"and impedes the generalizability of the shared representation. Specifically, we first present insights on understanding the parameterization effect of multi-task deep learning models and empirically show that larger models are not necessarily better in terms of multi-task generalization. A delicate balance between mitigating task training conflicts vs. improving generalizability of the shared presentation learning is needed to achieve optimal performance across multiple tasks. Motivated by our findings, we then propose the use of a under-parameterized self-auxiliary head alongside each task-specific sub-network during training, which automatically balances the aforementioned trade-off. As the auxiliary heads are small in size and are discarded during inference time, the proposed method incurs minimal training cost and no additional serving cost. We conduct experiments with the proposed self-auxiliaries on two public datasets and live experiments on one of the largest industrial recommendation platforms serving billions of users. The results demonstrate the effectiveness of the proposed method in improving the predictive performance across multiple tasks in multi-task models.

Cite

CITATION STYLE

APA

Wang, Y., Zhao, Z., Dai, B., Fifty, C., Lin, D., Hong, L., … Chi, E. H. (2022). Can Small Heads Help? Understanding and Improving Multi-Task Generalization. In WWW 2022 - Proceedings of the ACM Web Conference 2022 (pp. 3009–3019). Association for Computing Machinery, Inc. https://doi.org/10.1145/3485447.3512021

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free