Nemotron VL 8B - Fine-tuned for Section OCR (English & Kannada)

This model is a fine-tuned version of nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1 on the SectionOCR-SFT-augment dataset for multilingual section-level OCR tasks.

Model Description

  • Base Model: Llama-3.1-Nemotron-Nano-VL-8B-V1
  • Architecture: Llama-3.1-8B-Instruct + C-RADIOv2-H vision encoder
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Task: Section-level Optical Character Recognition (OCR)
  • Dataset: Nayana-cognitivelab/SectionOCR-SFT-augment (en-en and kn-kn subsets)
  • Languages: English and Kannada

Capabilities

This model specializes in accurate text extraction from document images, with particular strengths in:

  • Multilingual OCR: High accuracy for both English and Kannada text
  • Section-level extraction: Extracting text from specific document sections
  • Complex layouts: Handling documents with mixed text, tables, and formatting
  • Document understanding: Preserving document structure during extraction

Usage

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import torch

# Load model and processors
model = AutoModel.from_pretrained(
    "Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000")
image_processor = AutoImageProcessor.from_pretrained("Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000", trust_remote_code=True)

# Prepare input
image = Image.open("document.jpg")
system_prompt = """You are Nayana, an advanced AI assistant developed by CognitiveLab.
You specialize in vision-based tasks, particularly Optical Character Recognition (OCR)
and Document Visual Question Answering (Document VQA). You are highly accurate, fast,
and reliable when working with complex visual documents. Most importantly, you are
multilingual, capable of understanding and processing documents in a wide range of
languages with precision."""

prompt = "Accurately extract all text and output it"

# Process
messages = [
    {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt}
    ]}
]

# Generate
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# ... (continue with generation as in examples.py from the base model)

Training Details

Training Hyperparameters

  • LoRA Configuration:

    • r: 8
    • alpha: 16
    • dropout: 0.05
    • target_modules: all-linear
    • bias: none
  • Training Configuration:

    • Optimizer: AdamW (fused)
    • Learning rate: 3e-4
    • LR scheduler: cosine (with warmup)
    • Warmup ratio: 0.03
    • Batch size: 1 (per device)
    • Gradient accumulation: 1
    • Epochs: 1
    • Precision: bfloat16
    • Max sequence length: 4096
  • Memory Efficiency:

    • Vision encoder: Frozen
    • Vision projector: Trainable
    • Language model: LoRA fine-tuned
    • Trainable parameters: < 5% of total

Dataset

Fine-tuned on the Nayana-cognitivelab/SectionOCR-SFT-augment dataset:

  • en-en subset: 30,000 samples (English documents)
  • kn-kn subset: 100,000 samples (Kannada documents)
  • Total: 130,000 training samples

The dataset contains document images paired with accurate OCR text extractions for training the model to perform section-level OCR.

Evaluation

The model is evaluated using standard OCR metrics:

  • Character Error Rate (CER): Character-level accuracy
  • Word Error Rate (WER): Word-level accuracy
  • Exact Match Accuracy: Percentage of perfect extractions

Limitations

  • Inherits limitations from the base Nemotron VL model
  • Optimized for Section OCR tasks; may not perform as well on other vision-language tasks
  • Performance depends on document image quality and clarity
  • Currently trained on English and Kannada; performance on other languages may vary

Citation

Base Model:

@software{nemotron-vl,
  title = {Llama-3.1-Nemotron-Nano-VL-8B-V1},
  author = {NVIDIA},
  year = {2024},
  url = {https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1}
}

Dataset:

@dataset{sectionocr2024,
  title = {SectionOCR-SFT-augment},
  author = {Nayana CognitiveLab},
  year = {2024},
  url = {https://huggingface.co/datasets/Nayana-cognitivelab/SectionOCR-SFT-augment}
}

License

This model inherits the license from the base model: NVIDIA Open Model License

Acknowledgments

  • Base model by NVIDIA
  • Fine-tuning framework: Hugging Face Transformers + PEFT
  • Training infrastructure: Modal Labs
Downloads last month
47
Safetensors
Model size
9B params
Tensor type
I64
F32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000

Adapter
(1)
this model