ing/blob/master/ We present a novel approach to image captioning that does not require additional annotation, making it applicable to any data set. Our model requires only images and captions, yet still produces state-of-the-art results, even on the Conceptual Captions dataset containing over 3M images. We use the CLIP model which was previously trained on an extremely large number of images to generate semantic encodings for arbitrary images without further supervision. Then, we fine tune a pre-trained language model as a means of producing meaningful sentences from these encodings. The key idea is to use the CLIP encoding as a prefix before textual captions by using a mapping network over the raw encoding, followed by fine tuning our language model accordingly. Additionally, we propose another variant where we employ transformer architecture for our mapping network instead and do away with GPT2 fine tuning altogether; yet our light weight model has been proven comparable in performance against existing methods with regards to nocaps datasets.

You May Also Like.

Share Your Valuable Opinions