Nemotron VL 8B - Fine-tuned for Section OCR (English & Kannada)
This model is a fine-tuned version of nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1 on the SectionOCR-SFT-augment dataset for multilingual section-level OCR tasks.
Model Description
- Base Model: Llama-3.1-Nemotron-Nano-VL-8B-V1
 - Architecture: Llama-3.1-8B-Instruct + C-RADIOv2-H vision encoder
 - Fine-tuning Method: LoRA (Low-Rank Adaptation)
 - Task: Section-level Optical Character Recognition (OCR)
 - Dataset: Nayana-cognitivelab/SectionOCR-SFT-augment (en-en and kn-kn subsets)
 - Languages: English and Kannada
 
Capabilities
This model specializes in accurate text extraction from document images, with particular strengths in:
- Multilingual OCR: High accuracy for both English and Kannada text
 - Section-level extraction: Extracting text from specific document sections
 - Complex layouts: Handling documents with mixed text, tables, and formatting
 - Document understanding: Preserving document structure during extraction
 
Usage
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import torch
# Load model and processors
model = AutoModel.from_pretrained(
    "Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000")
image_processor = AutoImageProcessor.from_pretrained("Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000", trust_remote_code=True)
# Prepare input
image = Image.open("document.jpg")
system_prompt = """You are Nayana, an advanced AI assistant developed by CognitiveLab.
You specialize in vision-based tasks, particularly Optical Character Recognition (OCR)
and Document Visual Question Answering (Document VQA). You are highly accurate, fast,
and reliable when working with complex visual documents. Most importantly, you are
multilingual, capable of understanding and processing documents in a wide range of
languages with precision."""
prompt = "Accurately extract all text and output it"
# Process
messages = [
    {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt}
    ]}
]
# Generate
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# ... (continue with generation as in examples.py from the base model)
Training Details
Training Hyperparameters
LoRA Configuration:
- r: 8
 - alpha: 16
 - dropout: 0.05
 - target_modules: all-linear
 - bias: none
 
Training Configuration:
- Optimizer: AdamW (fused)
 - Learning rate: 3e-4
 - LR scheduler: cosine (with warmup)
 - Warmup ratio: 0.03
 - Batch size: 1 (per device)
 - Gradient accumulation: 1
 - Epochs: 1
 - Precision: bfloat16
 - Max sequence length: 4096
 
Memory Efficiency:
- Vision encoder: Frozen
 - Vision projector: Trainable
 - Language model: LoRA fine-tuned
 - Trainable parameters: < 5% of total
 
Dataset
Fine-tuned on the Nayana-cognitivelab/SectionOCR-SFT-augment dataset:
- en-en subset: 30,000 samples (English documents)
 - kn-kn subset: 100,000 samples (Kannada documents)
 - Total: 130,000 training samples
 
The dataset contains document images paired with accurate OCR text extractions for training the model to perform section-level OCR.
Evaluation
The model is evaluated using standard OCR metrics:
- Character Error Rate (CER): Character-level accuracy
 - Word Error Rate (WER): Word-level accuracy
 - Exact Match Accuracy: Percentage of perfect extractions
 
Limitations
- Inherits limitations from the base Nemotron VL model
 - Optimized for Section OCR tasks; may not perform as well on other vision-language tasks
 - Performance depends on document image quality and clarity
 - Currently trained on English and Kannada; performance on other languages may vary
 
Citation
Base Model:
@software{nemotron-vl,
  title = {Llama-3.1-Nemotron-Nano-VL-8B-V1},
  author = {NVIDIA},
  year = {2024},
  url = {https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1}
}
Dataset:
@dataset{sectionocr2024,
  title = {SectionOCR-SFT-augment},
  author = {Nayana CognitiveLab},
  year = {2024},
  url = {https://huggingface.co/datasets/Nayana-cognitivelab/SectionOCR-SFT-augment}
}
License
This model inherits the license from the base model: NVIDIA Open Model License
Acknowledgments
- Base model by NVIDIA
 - Fine-tuning framework: Hugging Face Transformers + PEFT
 - Training infrastructure: Modal Labs
 
- Downloads last month
 - 47
 
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	馃檵
			
		Ask for provider support
Model tree for Nayana-cognitivelab/Llama_Nemotron_SectionOCR_SFT_En_Kn_15000
Base model
nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1