Nemotron VL 8B - Fine-tuned for Section OCR (English & Kannada)

This model is a fine-tuned version of nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1 on the SectionOCR-SFT-augment dataset for multilingual section-level OCR tasks.

Model Description

Base Model: Llama-3.1-Nemotron-Nano-VL-8B-V1
Architecture: Llama-3.1-8B-Instruct + C-RADIOv2-H vision encoder
Fine-tuning Method: LoRA (Low-Rank Adaptation)
Task: Section-level Optical Character Recognition (OCR)
Dataset: Nayana-cognitivelab/SectionOCR-SFT-augment (en-en and kn-kn subsets)
Languages: English and Kannada

Capabilities

This model specializes in accurate text extraction from document images, with particular strengths in:

Multilingual OCR: High accuracy for both English and Kannada text
Section-level extraction: Extracting text from specific document sections
Complex layouts: Handling documents with mixed text, tables, and formatting
Document understanding: Preserving document structure during extraction

Usage

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import torch

# Load model and processors
model = AutoModel.from_pretrained(
    "Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000")
image_processor = AutoImageProcessor.from_pretrained("Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000", trust_remote_code=True)

# Prepare input
image = Image.open("document.jpg")
system_prompt = """You are Nayana, an advanced AI assistant developed by CognitiveLab.
You specialize in vision-based tasks, particularly Optical Character Recognition (OCR)
and Document Visual Question Answering (Document VQA). You are highly accurate, fast,
and reliable when working with complex visual documents. Most importantly, you are
multilingual, capable of understanding and processing documents in a wide range of
languages with precision."""

prompt = "Accurately extract all text and output it"

# Process
messages = [
    {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt}
    ]}
]

# Generate
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# ... (continue with generation as in examples.py from the base model)

Training Details

Training Hyperparameters

LoRA Configuration:
- r: 8
- alpha: 16
- dropout: 0.05
- target_modules: all-linear
- bias: none
Training Configuration:
- Optimizer: AdamW (fused)
- Learning rate: 3e-4
- LR scheduler: cosine (with warmup)
- Warmup ratio: 0.03
- Batch size: 1 (per device)
- Gradient accumulation: 1
- Epochs: 1
- Precision: bfloat16
- Max sequence length: 4096
Memory Efficiency:
- Vision encoder: Frozen
- Vision projector: Trainable
- Language model: LoRA fine-tuned
- Trainable parameters: < 5% of total

Dataset

Fine-tuned on the Nayana-cognitivelab/SectionOCR-SFT-augment dataset:

en-en subset: 30,000 samples (English documents)
kn-kn subset: 100,000 samples (Kannada documents)
Total: 130,000 training samples

The dataset contains document images paired with accurate OCR text extractions for training the model to perform section-level OCR.

Evaluation

The model is evaluated using standard OCR metrics:

Character Error Rate (CER): Character-level accuracy
Word Error Rate (WER): Word-level accuracy
Exact Match Accuracy: Percentage of perfect extractions

Limitations

Inherits limitations from the base Nemotron VL model
Optimized for Section OCR tasks; may not perform as well on other vision-language tasks
Performance depends on document image quality and clarity
Currently trained on English and Kannada; performance on other languages may vary

Citation

Base Model:

@software{nemotron-vl,
  title = {Llama-3.1-Nemotron-Nano-VL-8B-V1},
  author = {NVIDIA},
  year = {2024},
  url = {https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1}
}

Dataset:

@dataset{sectionocr2024,
  title = {SectionOCR-SFT-augment},
  author = {Nayana CognitiveLab},
  year = {2024},
  url = {https://huggingface.co/datasets/Nayana-cognitivelab/SectionOCR-SFT-augment}
}

License

This model inherits the license from the base model: NVIDIA Open Model License

Acknowledgments

Base model by NVIDIA
Fine-tuning framework: Hugging Face Transformers + PEFT
Training infrastructure: Modal Labs

Downloads last month: 47

Safetensors

Model size

9B params

Tensor type

I64

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000

Base model

nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1

Adapter

(1)

this model