Skip to content
/ nxtp Public

Object Recognition as Next Token Prediction (CVPR 2024 Highlight)

License

Notifications You must be signed in to change notification settings

kaiyuyue/nxtp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Object Recognition as Next Token Prediction

arXiv | Colab | Documentation | Hugging Face

Top 30 predictions with probabilities from our model on the image of "The Legend of Zelda: Tears of the Kingdom" 1.

Introduction

This is the official PyTorch implementation for the paper Object Recognition as Next Token Prediction accepted at CVPR 2024 (Highlight).

@inproceedings{nxtp,
  title     = {{Object Recognition as Next Token Prediction}},
  author    = {Kaiyu Yue and Bor-Chun Chen and Jonas Geiping and Hengduo Li and Tom Goldstein and Ser-Nam Lim},
  booktitle = {Computer Vision and Pattern Recognition Conference (CVPR)},
  year      = {2024}
}

Updates

May 26, 2024

Mar 17, 2024

Mar 03, 2024

Method

This project delves into a fundamental problem in computer vision − object recognition − translating an image into object labels.



Linear models (such as ResNet) and contrastive models (such as CLIP) require predefined labels before inference, limiting their flexibility in real-world applications.

We extend W to cover the entire textual space using language models like LLaMA's 32K token embeddings. Our model predicts labels in a real-open manner through auto-regressive processing.

Additionally, our one-shot sampling technique enables efficient large-scale discriminative predictions, such as the top-100 labels.



The released models have 1.78B parameters. Truncating the model to 0.77B parameters still achieves competitive performance (Table 3 in the paper), which only has one transformer block in the decoder.

Examples

Image w/ Top-20 Predictions Attention Map Image w/ Top-20 Predictions Attention Map


click to review 1prob: 0.13949 - legend
prob: 0.12399 - sky
prob: 0.04723 - cloud
prob: 0.04642 - game
prob: 0.04500 - screenshot
prob: 0.03189 - top
prob: 0.03024 - mountain
prob: 0.02262 - cliff
prob: 0.01790 - world
prob: 0.01483 - wii
prob: 0.01440 - video
prob: 0.01310 - breath
prob: 0.01087 - zeo
prob: 0.00982 - zelda
prob: 0.00959 - character
prob: 0.00865 - rock
prob: 0.00816 - link
prob: 0.00788 - island
prob: 0.00624 - adventure
prob: 0.00591 - woman


attention map infodecoder: layer 0: head 25

click to review 2prob: 0.23237 - rocket
prob: 0.10435 - launch
prob: 0.06144 - soyuz
prob: 0.04314 - space
prob: 0.03541 - smoke
prob: 0.03249 - sky
prob: 0.01971 - shuttle
prob: 0.01566 - tower
prob: 0.01551 - paris
prob: 0.01229 - cloud
prob: 0.01067 - pad
prob: 0.01050 - cape
prob: 0.00983 - falcon
prob: 0.00956 - photo
prob: 0.00834 - lift
prob: 0.00814 - air
prob: 0.00779 - mission
prob: 0.00710 - station
prob: 0.00688 - july
prob: 0.00647 - satellite


attention map infodecoder: layer 0: head 0

click to review 3prob: 0.30731 - dog
prob: 0.13647 - sweater
prob: 0.11870 - hat
prob: 0.06812 - scarf
prob: 0.04131 - brick
prob: 0.03114 - wall
prob: 0.01796 - shirt
prob: 0.01471 - cute
prob: 0.01156 - cap
prob: 0.00982 - neck
prob: 0.00929 - top
prob: 0.00797 - head
prob: 0.00777 - beanie
prob: 0.00658 - man
prob: 0.00588 - sits
prob: 0.00582 - coat
prob: 0.00524 - jacket
prob: 0.00476 - collar
prob: 0.00460 - face
prob: 0.00119 - bone


attention map infodecoder: layer 0: head 25

click to review 4prob: 0.14861 - coffee
prob: 0.10409 - shop
prob: 0.08065 - counter
prob: 0.04603 - bar
prob: 0.04055 - restaurant
prob: 0.03691 - inside
prob: 0.03468 - area
prob: 0.02638 - store
prob: 0.02219 - table
prob: 0.01930 - interior
prob: 0.01347 - lot
prob: 0.01156 - food
prob: 0.01058 - customer
prob: 0.01001 - room
prob: 0.00923 - starbucks
prob: 0.00853 - bakery
prob: 0.00738 - view
prob: 0.00738 - floor
prob: 0.00733 - cafe
prob: 0.00633 - shelf


attention map infodecoder: layer 0: head 8

click to review 3prob: 0.47652 - monster
prob: 0.09664 - cartoon
prob: 0.03812 - character
prob: 0.03724 - group
prob: 0.03312 - creature
prob: 0.02111 - cute
prob: 0.01929 - vector
prob: 0.01481 - animal
prob: 0.00955 - art
prob: 0.00924 - alien
prob: 0.00837 - pose
prob: 0.00604 - bubble
prob: 0.00553 - eye
prob: 0.00533 - color
prob: 0.00528 - hand
prob: 0.00477 - design
prob: 0.00474 - wallpaper
prob: 0.00462 - child
prob: 0.00445 - people
prob: 0.00445 - family


attention map infodecoder: layer 2: head 7

click to review 3prob: 0.54375 - cloud
prob: 0.09932 - word
prob: 0.07571 - sky
prob: 0.03153 - letter
prob: 0.01862 - sora
prob: 0.01380 - logo
prob: 0.00995 - text
prob: 0.00715 - top
prob: 0.00715 - blue
prob: 0.00677 - title
prob: 0.00608 - photo
prob: 0.00427 - picture
prob: 0.00288 - sonora
prob: 0.00269 - middle
prob: 0.00257 - storm
prob: 0.00202 - cloudscape
prob: 0.00190 - sun
prob: 0.00189 - art
prob: 0.00156 - soar
prob: 0.00041 - icy


attention map infodecoder: layer 1: head 13

click to review 3prob: 0.15317 - building
prob: 0.13619 - wave
prob: 0.04782 - room
prob: 0.03498 - middle
prob: 0.03188 - hall
prob: 0.02367 - people
prob: 0.02135 - ocean
prob: 0.02087 - floor
prob: 0.01867 - world
prob: 0.01773 - inside
prob: 0.01548 - man
prob: 0.01380 - water
prob: 0.01205 - view
prob: 0.01200 - surfer
prob: 0.01109 - photo
prob: 0.00798 - hotel
prob: 0.00734 - city
prob: 0.00662 - pool
prob: 0.00566 - art
prob: 0.00319 - mural


attention map infodecoder: layer 1: head 16

click to review 3prob: 0.25673 - bird
prob: 0.21676 - feather
prob: 0.18550 - peacock
prob: 0.04251 - head
prob: 0.03240 - blue
prob: 0.02507 - pigeon
prob: 0.02183 - tail
prob: 0.01339 - hair
prob: 0.01187 - top
prob: 0.00677 - face
prob: 0.00631 - camera
prob: 0.00463 - beak
prob: 0.00451 - eye
prob: 0.00419 - fence
prob: 0.00370 - sits
prob: 0.00333 - perch
prob: 0.00330 - photo
prob: 0.00318 - wall
prob: 0.00269 - animal
prob: 0.00106 - jay


attention map infodecoder: layer 1: head 25

click to review 5prob: 0.07247 - tablet
prob: 0.06770 - coffee
prob: 0.06562 - window
prob: 0.05829 - controller
prob: 0.05668 - game
prob: 0.04802 - switch
prob: 0.04043 - wii
prob: 0.03798 - console
prob: 0.03563 - cup
prob: 0.02570 - top
prob: 0.02067 - mug
prob: 0.01808 - screen
prob: 0.01344 - video
prob: 0.01105 - star
prob: 0.01092 - nintendo
prob: 0.01055 - computer
prob: 0.00819 - mario
prob: 0.00815 - remote
prob: 0.00736 - control
prob: 0.00393 - sill


attention map infodecoder: layer 0: head 12

click to review 6prob: 0.36523 - airplane
prob: 0.09151 - cargo
prob: 0.07531 - plane
prob: 0.05538 - ship
prob: 0.04223 - container
prob: 0.03105 - water
prob: 0.03040 - view
prob: 0.02277 - dock
prob: 0.01685 - port
prob: 0.01434 - sky
prob: 0.01328 - shipping
prob: 0.00788 - middle
prob: 0.00751 - body
prob: 0.00717 - photo
prob: 0.00715 - jet
prob: 0.00714 - city
prob: 0.00621 - ocean
prob: 0.00615 - freight
prob: 0.00609 - boat
prob: 0.00320 - transportation


attention map infodecoder: layer 2: head 14

click to review 6prob: 0.15236 - candy
prob: 0.12271 - sweater
prob: 0.11457 - glass
prob: 0.10593 - dog
prob: 0.08311 - chair
prob: 0.07111 - cane
prob: 0.04701 - sunglass
prob: 0.04589 - christmas
prob: 0.02361 - costume
prob: 0.02085 - wearing
prob: 0.01870 - hat
prob: 0.00734 - head
prob: 0.00636 - top
prob: 0.00577 - outfit
prob: 0.00520 - chocolate
prob: 0.00437 - holi
prob: 0.00362 - suit
prob: 0.00344 - shirt
prob: 0.00322 - strawberry
prob: 0.00211 - wig


attention map infodecoder: layer 1: head 16

click to review 6prob: 0.19960 - living
prob: 0.16291 - room
prob: 0.11353 - sofa
prob: 0.06036 - couch
prob: 0.04741 - rug
prob: 0.04704 - coffee
prob: 0.03795 - dog
prob: 0.03659 - wall
prob: 0.02980 - table
prob: 0.01611 - floor
prob: 0.01594 - grey
prob: 0.01472 - wood
prob: 0.01353 - furniture
prob: 0.01314 - plant
prob: 0.01274 - fireplace
prob: 0.01161 - pillow
prob: 0.00941 - chair
prob: 0.00512 - home
prob: 0.00434 - blanket
prob: 0.00351 - art


attention map infodecoder: layer 1: head 16

Models

The following table shows the reproduced results of recall (R column in Table 1 of the paper) on the validation splits with top-10 predictions.

# params training group checkpoint md5 CC3M COCO OpenImages
1.78B   G3M Hugging Face b2a69b 0.740 0.703 0.616
1.78B G70M Hugging Face e177c7 0.721 0.765 0.662

Downloading

The checkpoints can be downloaded from the links in the table above. For downloading from Hugging Face, one option is to use git-lfs:

# install git lfs
git lfs install

# download the checkpoint in terminal
git clone https://huggingface.co/kaiyuyue/nxtp

Also, the checkpoint can be downloaded from the model page in the web browser.

Inference

There is an image assets/starbux.jpg for a quick test. First, please follow the instructions in Dependencies to prepare the environment.

To infer an image, please run

python src/infer.py \
  --ckpt-path path/to/model/checkpoint \
  --img-path assets/starbux.jpg \
  --num-labels 20

The output from model trained on G3M will be

top-20 predictions:
| prob: 0.05742 - coffee
| prob: 0.05525 - restaurant
| prob: 0.04402 - shop
| prob: 0.02528 - room
| prob: 0.02468 - store
| prob: 0.02381 - interior
| prob: 0.01732 - area
| prob: 0.01640 - building
| prob: 0.01616 - food
| prob: 0.01408 - bar
| prob: 0.01247 - customer
| prob: 0.01134 - view
| prob: 0.01059 - floor
| prob: 0.01045 - table
| prob: 0.00933 - kitchen
| prob: 0.00926 - home
| prob: 0.00872 - look
| prob: 0.00841 - people
| prob: 0.00693 - cup
| prob: 0.00665 - counter

The output from model trained on G70M is

top-20 predictions:
| prob: 0.15203 - coffee
| prob: 0.09728 - shop
| prob: 0.09182 - counter
| prob: 0.03848 - interior
| prob: 0.03389 - bar
| prob: 0.03215 - restaurant
| prob: 0.02440 - table
| prob: 0.02245 - store
| prob: 0.01950 - area
| prob: 0.01905 - inside
| prob: 0.01590 - starbucks
| prob: 0.01313 - cafe
| prob: 0.01220 - chair
| prob: 0.01172 - floor
| prob: 0.01020 - cup
| prob: 0.00879 - drink
| prob: 0.00794 - room
| prob: 0.00746 - customer
| prob: 0.00635 - wood
| prob: 0.00345 - bakery

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Footnotes

  1. Image credit: ゼルダの伝説 The Legend of Zelda: Tears of the Kingdom. 2

  2. Image credit: Space-X.

  3. Image credit: OpenAI Sora. 2 3 4 5

  4. Image credit: Photo taken by the author at a Starbucks store.

  5. Image credit: Super Mario Bros Wonder.

  6. Image credit: Demo in Segment Anything | Meta AI. 2 3

About

Object Recognition as Next Token Prediction (CVPR 2024 Highlight)

Resources

License

Stars

Watchers

Forks