Add library_name, update paper link and enhance model table

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +23 -14
README.md CHANGED
@@ -1,31 +1,33 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - vi
5
  base_model:
6
  - vinai/phobert-base
 
 
 
7
  pipeline_tag: feature-extraction
8
  tags:
9
  - bert
10
  - wsd
11
  - vietnamese
12
  - semantic_similarity
 
13
  ---
 
14
  # ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations
15
 
16
- [Paper](https://huggingface.co/tkhangg0910/viconbert-base)
17
 
18
  This repository is official implementation of the paper: ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations
19
 
20
  ![](https://github.com/tkhangg0910/ViConBERT/blob/main/figs/architecture.jpg?raw=true)
21
  <p align="center"><em>Main architecture</em></p>
22
 
23
- * **Abstract:**
24
- Recent progress in contextualized word embeddings has significantly advanced tasks involving word semantics, such as Word Sense Disambiguation (WSD) and contextual semantic similarity. However, these developments have largely focused on high-resource languages like English, while low-resource languages such as Vietnamese remain underexplored. This paper introduces a novel training framework for Vietnamese contextualized word embeddings, which integrates contrastive learning (SimCLR) and distillation with the gloss embedding space to better model word meaning. Additionally, we introduce a new dataset specifically designed to evaluate semantic understanding tasks in Vietnamese, which we constructed as part of this work. Experimental results demonstrate that ViConBERT outperforms strong baselines on the WSD task (F1 = 0.87) and achieves competitive results on ViCon (AP = 0.88) and ViSim-400 (Spearman’s $\rho$ = 0.60), effectively modeling both binary and graded semantic relations in Vietnamese.
25
 
26
  ### Installation <a name="install2"></a>
27
- - Install `transformers` with pip: `pip install transformers`, or [install `transformers` from source](https://huggingface.co/docs/transformers/installation#installing-from-source). <br />
28
- Note that we merged a slow tokenizer for PhoBERT into the main `transformers` branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in [this pull request](https://github.com/huggingface/transformers/pull/17254#issuecomment-1133932067). If users would like to utilize the fast tokenizer, the users might install `transformers` as follows:
29
 
30
  ```
31
  git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
@@ -33,7 +35,7 @@ cd transformers
33
  pip3 install -e .
34
  ```
35
 
36
- - Install others dependencies [`requirements`](https://github.com/tkhangg0910/ViConBERT/blob/main/requirements.txt) :
37
  ```
38
  pip3 install -r requirements.txt
39
  ```
@@ -42,14 +44,14 @@ pip3 install -r requirements.txt
42
  ### ViConBERT models <a name="models2"></a>
43
 
44
 
45
- Model | #params | Arch. | Max length | Training data
46
- ---|---|---|---|---
47
- [`tkhangg0910/viconbert-base`](https://huggingface.co/tkhangg0910/viconbert-base) | 135M | base | 256 | [ViConWSD](https://huggingface.co/datasets/tkhangg0910/ViConWSD)
48
- [`tkhangg0910/viconbert-large`](https://huggingface.co/tkhangg0910/viconbert-large) | 370M | large | 256 | [ViConWSD](https://huggingface.co/datasets/tkhangg0910/ViConWSD)
49
 
50
 
51
  ### Example usage <a name="usage2"></a>
52
- SpanExtractor and text_normalize are implemented in [`code`](https://github.com/tkhangg0910/ViConBERT/tree/main/utils)
53
  ```python
54
  import logging
55
  from typing import Optional, Tuple
@@ -127,3 +129,10 @@ print(f"Similarity between 2: {target_2} and 3:{target_3}: {sim_2:.4f}")
127
  <em>Contextual separation of "Khoan", "chạy", and zero-shot ability for unseen words</em>
128
  </p>
129
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - vinai/phobert-base
4
+ language:
5
+ - vi
6
+ license: apache-2.0
7
  pipeline_tag: feature-extraction
8
  tags:
9
  - bert
10
  - wsd
11
  - vietnamese
12
  - semantic_similarity
13
+ library_name: transformers
14
  ---
15
+
16
  # ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations
17
 
18
+ [Paper](https://huggingface.co/papers/2511.12249) - [Code](https://github.com/tkhangg0910/ViConBERT)
19
 
20
  This repository is official implementation of the paper: ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations
21
 
22
  ![](https://github.com/tkhangg0910/ViConBERT/blob/main/figs/architecture.jpg?raw=true)
23
  <p align="center"><em>Main architecture</em></p>
24
 
25
+ * **Abstract:**
26
+ Recent progress in contextualized word embeddings has significantly advanced tasks involving word semantics, such as Word Sense Disambiguation (WSD) and contextual semantic similarity. However, these developments have largely focused on high-resource languages like English, while low-resource languages such as Vietnamese remain underexplored. This paper introduces a novel training framework for Vietnamese contextualized word embeddings, which integrates contrastive learning (SimCLR) and distillation with the gloss embedding space to better model word meaning. Additionally, we introduce a new dataset specifically designed to evaluate semantic understanding tasks in Vietnamese, which we constructed as part of this work. Experimental results demonstrate that ViConBERT outperforms strong baselines on the WSD task (F1 = 0.87) and achieves competitive results on ViCon (AP = 0.88) and ViSim-400 (Spearman’s $\rho$ = 0.60), effectively modeling both binary and graded semantic relations in Vietnamese.
27
 
28
  ### Installation <a name="install2"></a>
29
+ - Install `transformers` with pip: `pip install transformers`, or [install `transformers` from source](https://huggingface.co/docs/transformers/installation#installing-from-source). <br />
30
+ Note that we merged a slow tokenizer for PhoBERT into the main `transformers` branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in [this pull request](https://github.com/huggingface/transformers/pull/17254#issuecomment-1133932067). If users would like to utilize the fast tokenizer, the users might install `transformers` as follows:
31
 
32
  ```
33
  git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
 
35
  pip3 install -e .
36
  ```
37
 
38
+ - Install others dependencies [`requirements`](https://github.com/tkhangg0910/ViConBERT/blob/main/requirements.txt) :
39
  ```
40
  pip3 install -r requirements.txt
41
  ```
 
44
  ### ViConBERT models <a name="models2"></a>
45
 
46
 
47
+ | Model | #params | Arch. | Max length | Backbone | Training data |
48
+ |---|---|---|---|---|---|
49
+ | [`tkhangg0910/viconbert-base`](https://huggingface.co/tkhangg0910/viconbert-base) | 135M | base | 256 | [PhoBERT-base](https://huggingface.co/vinai/phobert-base) | [ViConWSD](https://huggingface.co/datasets/tkhangg0910/ViConWSD) |
50
+ | [`tkhangg0910/viconbert-large`](https://huggingface.co/tkhangg0910/viconbert-large) | 370M | large | 256 | [PhoBERT-large](https://huggingface.co/vinai/phobert-large) | [ViConWSD](https://huggingface.co/datasets/tkhangg0910/ViConWSD) |
51
 
52
 
53
  ### Example usage <a name="usage2"></a>
54
+ SpanExtractor and text_normalize are implemented in [`code`](https://github.com/tkhangg0910/ViConBERT/tree/main/utils)
55
  ```python
56
  import logging
57
  from typing import Optional, Tuple
 
129
  <em>Contextual separation of "Khoan", "chạy", and zero-shot ability for unseen words</em>
130
  </p>
131
 
132
+
133
+
134
+ ## Citation
135
+ If you find ViConBERT useful for your research and applications, please cite using this BibTeX:
136
+
137
+ ## Acknowledgement
138
+ [PhoBERT](https://github.com/VinAIResearch/PhoBERT): ViConBERT used PhoBERT as backbone model.