Rotated Positional Embedding for Object Detection in Latent Space
The initial positional embeddings are rotated to align with the latent coordinates of the tagged objects. Positioning them in proximity to the corresponding object in the image.
Built on a multimodal model, Wan2.1 encoded the image.
Categories:
- [1] hat
- [2] hair
- [3] sunglasses
- [4] shirt
- [5] skirt
- [6] pants
- [7] dress
- [8] belt
- [9] shoes
- [11] face
- [12] legs
- [14] arms
- [16] bag
- [17] scarf
Disclaimer
The documentation and the model requires citation and attribution to the author via a link to their Hugging Face profile.

