Large-Scale Differentially Private BERT

33Citations
Citations of this article
59Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this work, we study the large-scale pretraining of BERT-Large (Devlin et al., 2019) with differentially private SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch size to millions (i.e., mega-batches) improves the utility of the DP-SGD step for BERT; we also enhance the training efficiency by using an increasing batch size schedule. Our implementation builds on the recent work of Subramani et al. (2020), who demonstrated that the overhead of a DP-SGD step is minimized with effective use of JAX (Bradbury et al., 2018; Frostig et al., 2018) primitives in conjunction with the XLA compiler (XLA team and collaborators, 2017). Our implementation achieves a masked language model accuracy of 60.5% at a batch size of 2M, for ε = 5, which is a reasonable privacy setting. To put this number in perspective, non-private BERT models achieve an accuracy of ∼70%.

Cite

CITATION STYLE

APA

Anil, R., Ghazi, B., Gupta, V., Kumar, R., & Manurangsi, P. (2022). Large-Scale Differentially Private BERT. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 6510–6520). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-emnlp.484

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free