File size: 3,052 Bytes
a4c8447
 
 
86fe4a5
618e666
a4c8447
618e666
a4c8447
618e666
 
 
f18f208
618e666
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
library_name: keras-hub
---
### Model Overview
# Model Summary

Vision Transformer (ViT) adapts the Transformer architecture, originally designed for natural language processing, to the domain of computer vision. It treats images as sequences of patches, similar to how Transformers treat sentences as sequences of words.. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929).

## Links:

* [Vit Quickstart Notebook](https://www.kaggle.com/code/sineeli/vit-quickstart)
* [Vit API Documentation](https://keras.io/keras_hub/api/models/vit/)
* [Vit Model Card](https://huggingface.co/google/vit-base-patch16-224)
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

## Installation

Keras and KerasHub can be installed with:

```
pip install -U -q keras-hub
pip install -U -q keras
```

## Presets

Model ID | img_size |Acc | Top-5 | Parameters |
:--: |:--:|:--:|:--:|:--:|
**Base**|  
vit_base_patch16_224_imagenet |224|-|-|85798656|
vit_base_patch_16_224_imagenet21k|224|-|-|85798656|
vit_base_patch_16_384_imagenet|384|-|-|86090496|
vit_base_patch32_224_imagenet21k|224|-|-|87455232|
vit_base_patch32_384_imagenet|384|-|-|87528192|
**Large**|
vit_large_patch16_224_imagenet|224|-|-|303301632|
vit_large_patch16_224_imagenet21k|224|-|-|303301632|
vit_large_patch16_384_imagenet|224|-|-|303690752|
vit_large_patch32_224_imagenet21k|224|-|-|305510400|
vit_large_patch32_384_imagenet|224|-|-|305607680|
**Huge**|
vit_huge_patch14_224_imagenet21k|224|-|-|630764800|

## Example Usage
## Pretrained ViT model
```
image_classifier = keras_hub.models.ImageClassification.from_preset(
    "vit_base_patch32_384_imagenet"
)

input_data = np.random.uniform(0, 1, size=(2, 224, 224, 3))
image_classifier(input_data)
```

## Load the backbone weights and fine-tune model for custom dataset.
```python3
backbone = keras_hub.models.Backbone.from_preset(
    "vit_base_patch32_384_imagenet"
)
preprocessor = keras_hub.models.ViTImageClassifierPreprocessor.from_preset(
    "vit_base_patch32_384_imagenet"
)
model = keras_hub.models.ViTImageClassifier(
    backbone=backbone,
    num_classes=len(CLASSES),
    preprocessor=preprocessor,
)
```

## Example Usage with Hugging Face URI

## Pretrained ViT model
```
image_classifier = keras_hub.models.ImageClassification.from_preset(
    "hf://keras/vit_base_patch32_384_imagenet"
)

input_data = np.random.uniform(0, 1, size=(2, 224, 224, 3))
image_classifier(input_data)
```

## Load the backbone weights and fine-tune model for custom dataset.
```python3
backbone = keras_hub.models.Backbone.from_preset(
    "hf://keras/vit_base_patch32_384_imagenet"
)
preprocessor = keras_hub.models.ViTImageClassifierPreprocessor.from_preset(
    "hf://keras/vit_base_patch32_384_imagenet"
)
model = keras_hub.models.ViTImageClassifier(
    backbone=backbone,
    num_classes=len(CLASSES),
    preprocessor=preprocessor,
)
```