Learning to Learn Words from Visual Scenes

Dídac Surís^1* Dave Epstein^1* Heng Ji² Shih-Fu Chang¹ Carl Vondrick¹

¹ Columbia University
² University of Illinois at Urbana-Champaign

Paper Code and model

then, I spread the ghee on the roti

Can you figure out what “ghee” and “roti” are? Hover to show the answer. Although the words “ghee” and “roti” may be unfamiliar to you, you are able to leverage the structure of the visual world and knowledge of other words to learn new words.

When we travel, we often encounter new scenarios we have never experienced before, with new sights and new words that describe them. We can use our language-learning ability to quickly learn these new words and correlate them with the visual world. In contrast, language models often do not robustly generalize to novel words and compositions.

We propose a framework that learns how to learn text representations from visual context. Experiments show that our approach significantly outperforms the state-of-the-art in visual language modeling for acquiring new words and predicting new compositions. Model ablations and visualizations suggest that the visual modality helps our approach more robustly generalize at these tasks.

Paper

arXiv PDF

@article{suris2020learning,
  title={Learning to learn words from visual scenes},
  author={Surís, Dídac and Epstein, Dave and Ji, Heng and Chang, Shih-Fu and Vondrick, Carl.},
  journal={European Conference on Computer Vision (ECCV)},
  year={2020}
}

Acquiring New Words

We propose a meta-learning approach that learns to learn a visual language model for generalization. We construct training episodes containing a reference set of text-image scenes and a target example. To train the model, we mask input elements from the target, and ask the model to reconstruct them by pointing to them in the reference set. Our model can describe scenes with words not seen during training by pointing to them.

Reference set

Target example

remove3%

skin

garlic

take3%

the

peel

off

the

onion

gather93%

pieces

pepper

from

sink

MASK

carrot

peelings

Reference set

Target example

open

the

cupboard4%

wash

plates

with

rag

oven95%

switch

off

MASK

the

right

Reference set

Target example

grab7%

pan

place45%

tablecloth

put47%

bag

into

the

bin

MASK

plate

Reference set

Target example

get

avocado62%

still

taking

skin33%

off

fish3%

stir

rice

into

pan

peel

MASK

Video Presentation

Presentation as an Oral to the CVPR 2020 Minds vs. Machines workshop

Code and model

Our model uses Transformers to operate on both visual input and text. For code and pretrained models, go to our Github project.

Acknowledgements

Funding for this research was provided by DARPA GAILA HR00111990058. We thank Nvidia for GPU donations. The webpage template was inspired by this project page.