Qwen2.5-7B-Instruct Vietnamese Text-to-Cypher

This model is a fine-tuned version of Qwen/Qwen2.5-7B-Instruct for English to Vietnamese translation with a focus on text-to-Cypher query translation tasks.

Model Details

Base Model: Qwen/Qwen2.5-7B-Instruct
Fine-tuning Method: LoRA (Low-Rank Adaptation) with RSLoRA
Training Framework: TRL SFTTrainer
Task: English to Vietnamese translation for database query descriptions
Language: English → Vietnamese
Model Type: Causal Language Model

Training Details

Training Data

Training Samples: 4,559 samples
Validation Samples: 1,140 samples
Data Format: English database queries with Vietnamese translations
Domain: Database query descriptions and Cypher query language

Training Configuration

LoRA Rank: 16
LoRA Alpha: 32
LoRA Dropout: 0.1
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Use RSLoRA: True (Rank-Stabilized LoRA)
Quantization: 4-bit with BitsAndBytesConfig
Epochs: 3
Batch Size: 2 per device
Gradient Accumulation: 4 steps
Learning Rate: 3e-4
Optimizer: AdamW with weight decay 0.01

Training Results

Final Training Loss: 0.027
Final Validation Loss: 0.045
Token Accuracy: 98.7%
Translation Accuracy: 100% on test samples

Usage

Installation

pip install transformers torch peft accelerate

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model = "Qwen/Qwen2.5-7B-Instruct"
model_path = "hoadm-lab/qwen2.5-7b-instruct-vitext2cypher"

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA weights
model = PeftModel.from_pretrained(model, model_path)

Translation Example

def translate_text(text: str) -> str:
    messages = [
        {
            "role": "system", 
            "content": "You are a professional translator. Only return the Vietnamese translation of the following question. Keep technical keywords and proper names unchanged."
        },
        {
            "role": "user", 
            "content": text
        }
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.3,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.1,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract translation (implementation depends on chat template)
    return extract_translation(response)

# Example usage
english_text = "Find the top 5 suppliers with the highest average unit price."
vietnamese_translation = translate_text(english_text)
print(vietnamese_translation)
# Output: "Tìm 5 nhà cung cấp hàng đầu có giá đơn vị trung bình cao nhất."

Training Examples

Example 1

EN: "Identify the 5 suppliers with the highest average unit price of products."
VI: "Xác định 5 nhà cung cấp có giá đơn vị trung bình của sản phẩm cao nhất."

Example 2

EN: "What are the names of technicians who have not been assigned machine repair tasks?"
VI: "Tên của những kỹ thuật viên chưa được giao nhiệm vụ sửa máy là gì?"

Example 3

EN: "How many companies are there in total?"
VI: "Tổng số công ty là bao nhiêu?"

Model Performance

The model achieves excellent translation quality with:

100% accuracy on test samples
Natural Vietnamese output
Preservation of technical terms
Consistent formatting

Technical Specifications

Parameters: ~7B (base) + LoRA adapters
Memory Usage: ~13GB VRAM (4-bit quantization)
Inference Speed: ~2-3 seconds per translation on RTX 3090
Max Context Length: 2048 tokens

Limitations

Specialized for database query translation domain
May not generalize well to other translation tasks
Requires sufficient GPU memory for inference
Vietnamese translations may vary in style

Citation

If you use this model in your research, please cite:

@misc{qwen2.5-vitext2cypher,
  title={Qwen2.5-7B-Instruct Vietnamese Text-to-Cypher Fine-tuned Model},
  author={hoadm-lab},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/hoadm-lab/qwen2.5-7b-instruct-vitext2cypher}
}

License

Apache 2.0 (same as base model)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for hoadm-lab/qwen2.5-7b-instruct-vitext2cypher

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(726)

this model