Vision Transformer (ViT) - Using Transformers for Image Classification

Vision Transformer (ViT) - Using Transformers for Image Classification | HuggingFace

Farry July 20, 2021

HuggingFace Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository.

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

Link to the notebook : https://github.com/bhattbhavesh91/hugging-face-vision-transformer-tutorial/blob/main/hugging-face-vit-notebook.ipynb

Join this channel to get access to perks:
https://www.youtube.com/channel/UC8ofcOdHNINiPrBA9D59Vaw/join

If you like my work, you can support me by buying me a coffee by clicking the link below: https://www.buymeacoffee.com/bhattbhavesh91

If you do have any questions with what we covered in this video then feel free to ask in the comment section below & I'll do my best to answer those.

If you enjoy these tutorials & would like to support them then the easiest way is to simply like the video & give it a thumbs up & also it's a huge help to share these videos with anyone who you think would find them useful.

Please consider clicking the SUBSCRIBE button to be notified for future videos & thank you all for watching.

You can find me on:
Blog - https://bhattbhavesh91.github.io
Twitter - https://twitter.com/_bhaveshbhatt
GitHub - https://github.com/bhattbhavesh91
Medium - https://medium.com/@bhattbhavesh91
About.me - https://about.me/bhattbhavesh91
Linktree - https://linktr.ee/bhattbhavesh91
DEV Community - https://dev.to/bhattbhavesh91

#VisionTransformer #ComputerVision