A large-scale image-text pair dataset has greatly contributed to the development of vision-language pre-training (VLP) models, which enable zero-shot or few-shot classification without costly annotation. However, in the medical domain, the scarcity of data remains a significant challenge for developing a powerful VLP model. In this paper, we tackle the lack of image-text data in chest X-ray by expanding image-label pair as image-text pair via general prompt and utilizing multiple images and multiple sections in a radiologic report. We also design two contrastive losses, named ICL and TCL, for learning study-level characteristics of medical images and reports, respectively. Our model outperforms the state-of-the-art models trained under the same conditions. Also, enlarged dataset improve the discriminative power of our pre-trained model for classification, while sacrificing marginal retrieval performance. Code is available at https://github.com/kakaobrain/cxr-clip.
CITATION STYLE
You, K., Gu, J., Ham, J., Park, B., Kim, J., Hong, E. K., … Roh, B. (2023). CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 14221 LNCS, pp. 101–111). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-43895-0_10
Mendeley helps you to discover research relevant for your work.