SQLCoder 7B - CIM Spatial SQL (BIRD Pre-trained Baseline)
Fine-tuned for thesis comparison: Generic BIRD pre-training vs Domain-Specific Training
This model is a fine-tuned version of defog/sqlcoder-7b-2 on the taherdoust/ai4cimdb dataset for City Information Modeling (CIM) spatial SQL generation.
Model Description
Purpose: Academic baseline for thesis comparison - demonstrates the performance gap between generic text-to-SQL models (trained on Spider/BIRD) and domain-specific models when handling PostGIS spatial functions.
Training Strategy: Minimal adaptation (1 epoch only) to measure transfer learning effectiveness from BIRD to PostGIS spatial SQL domain.
Key Characteristics
- Base Model: SQLCoder 7B-2 (StarCoder-based, pre-trained on Spider + BIRD + commercial SQL)
- Training: 1 epoch fine-tuning on CIM spatial SQL dataset
- Training Time: 71.7 hours on NVIDIA Quadro RTX 6000 24GB
- Method: QLoRA (4-bit quantization + LoRA rank 16)
- Trainable Parameters: 39,976,960 (0.59% of 6.78B total)
Research Question
How does a generic text-to-SQL model perform on specialized spatial SQL tasks?
This model establishes a baseline for comparing:
- Generic BIRD pre-training (standard SQL) vs Domain-specific training (PostGIS spatial)
- Transfer learning effectiveness for specialized SQL dialects
- Performance gaps on PostGIS spatial functions (ST_Intersects, ST_Within, ST_Distance, etc.)
Intended Use
Direct Use
Generate PostGIS spatial SQL queries for City Information Modeling databases from natural language questions.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
question = "Find all buildings within 100 meters of census zone SEZ123"
prompt = f"""### Task
Generate a SQL query to answer the question.
### Database Schema
- cim_vector.cim_wizard_building (building_id, building_geometry, project_id)
- cim_census.censusgeo (id, sez2011, census_geometry, population)
### Question
{question}
### SQL Query
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
sql = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(sql)
Thesis Comparison Use
Compare this model's performance against domain-specific models to quantify the PostGIS knowledge gap:
Expected Performance:
- Standard SQL: 85-90% (similar to domain models)
- PostGIS Spatial Functions: 30-50% (vs 85-92% domain models) ← Gap
- CIM Domain Terms: 40-60% (vs 85-90% domain models)
- Overall EX Accuracy: 60-75% baseline → 75-85% after 1 epoch fine-tuning
Training Details
Training Data
- Dataset: taherdoust/ai4cimdb
- Training Samples: 88,480 (70% of 126,400 curated)
- Validation Samples: 18,960 (15%)
- Test Samples: 18,960 (15%)
- Total Raw Samples: 176,837 (3-stage generation: templates → CTGAN → GPT-4o-mini)
Training Procedure
QLoRA Configuration:
BitsAndBytesConfig:
- load_in_4bit: True
- bnb_4bit_quant_type: "nf4"
- bnb_4bit_compute_dtype: bfloat16
LoraConfig:
- r: 16
- lora_alpha: 32
- lora_dropout: 0.1
- target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Training Hyperparameters:
- Epochs: 1 (minimal adaptation for baseline comparison)
- Batch Size: 2 per device
- Gradient Accumulation: 8 steps (effective batch size: 16)
- Learning Rate: 2.0e-4 (higher for 1-epoch training)
- LR Scheduler: Cosine with 10% warmup
- Optimizer: Paged AdamW 8-bit
- Precision: bfloat16
- Gradient Checkpointing: Enabled
- Max Sequence Length: 2048 tokens
Training Results:
- Training Time: 71 hours 43 minutes (258,221 seconds)
- Final Training Loss: 0.0980
- Training Speed: 0.343 samples/sec
- Total Steps: 5,530 steps
- Hardware: NVIDIA Quadro RTX 6000 (24GB VRAM)
Database Schema Context
CIM Database (PostgreSQL + PostGIS):
cim_vector Schema:
cim_wizard_building: Building geometries (POLYGON)cim_wizard_building_properties: Building attributes (height, area, energy)cim_wizard_project_scenario: Project and scenario metadatanetwork_buses,network_lines: Electrical grid infrastructure (POINT, LINESTRING)
cim_census Schema:
censusgeo: Italian ISTAT 2011 census zones (POLYGON)- Demographic data: population, age distribution (E8-E16), housing (ST3-ST5)
cim_raster Schema:
dtm: Digital Terrain Model (RASTER)dsm: Digital Surface Model (RASTER)- Building height calculation via raster-vector operations
Evaluation Metrics
Expected Performance (Thesis Hypothesis)
| Metric | SQLCoder (BIRD) | Domain Models | Gap |
|---|---|---|---|
| Standard SQL | 85-90% | 85-92% | ±2-5% |
| PostGIS Functions | 30-50% | 85-92% | 35-50% ← Research Gap |
| CIM Domain Terms | 40-60% | 85-90% | 25-40% |
| Multi-Schema | 60-75% | 82-90% | 15-25% |
| Overall EX | 60-75% | 82-92% | 15-25% |
Key Finding: Generic BIRD models struggle with specialized SQL dialects (PostGIS spatial functions) despite strong standard SQL performance.
Evaluation Modes
EM (Exact Match): String-level comparison (25-35% expected) EX (Execution Accuracy): Result-level comparison (60-75% expected) EA (Eventual Accuracy): Agent mode with self-correction (65-80% expected)
Academic Contribution
Thesis Context
This model serves as a controlled baseline for demonstrating:
- PostGIS Knowledge Gap: BIRD pre-training lacks spatial function exposure
- Domain Terminology Gap: CIM-specific terms (SEZ2011, TABULA, E8-E16) require domain training
- Transfer Learning Limits: 1 epoch fine-tuning improves but doesn't close the gap
- Multi-Schema Complexity: Cross-schema joins (cim_vector + cim_census + cim_raster) challenge generic models
Comparison Framework
Models for Comparison:
- This Model (SQLCoder 7B BIRD): Generic baseline
- Llama 3.1 8B (Domain-specific): 3 epochs on CIM data
- Qwen 2.5 14B (Domain-specific): 3 epochs on CIM data
- DeepSeek-Coder 6.7B (Domain-specific): 3 epochs on CIM data
Environmental Impact
- Hardware: NVIDIA Quadro RTX 6000 (24GB VRAM, 250W TDP)
- Training Time: 71.7 hours
- Estimated Energy: ~17.9 kWh (250W × 71.7h)
- Carbon Footprint: ~7.2 kg CO₂ (401 g CO₂/kWh, Italy grid 2024)
Technical Specifications
Model Architecture
- Base: StarCoder (7B parameters)
- Attention: Multi-head attention with 32 heads
- Layers: 32 transformer layers
- Vocabulary: 49,152 tokens
- Context Window: 8,192 tokens
- Activation: GELU
LoRA Adapter
- Adapter Size: ~260 MB (safetensors)
- Trainable Params: 39,976,960 (0.59%)
- Target Modules: 7 modules (q/k/v/o/gate/up/down projections)
- Rank: 16
- Alpha: 32
Citation
@misc{taherdoust2025sqlcoder_cim,
title={SQLCoder 7B for CIM Spatial SQL: BIRD Baseline Comparison},
author={Taherdoust, Ali},
year={2025},
institution={Politecnico di Torino},
note={Fine-tuned on ai4cimdb dataset for thesis comparison:
Generic BIRD pre-training vs Domain-specific PostGIS training},
url={https://huggingface.co/taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison}
}
@software{defog2024sqlcoder,
title={SQLCoder: A State-of-the-Art LLM for SQL Generation},
author={Defog.ai},
year={2024},
url={https://github.com/defog-ai/sqlcoder}
}
Acknowledgments
- Base Model: defog/sqlcoder-7b-2 by Defog.ai
- Dataset: taherdoust/ai4cimdb (176K samples)
- Institution: Politecnico di Torino, DENERG Department
- Infrastructure: ECLab ipazia126 GPU server
- Frameworks: HuggingFace Transformers, PEFT, BitsAndBytes
License
Apache 2.0 (inherited from SQLCoder base model)
Model Card Authors
Ali Taherdoust (Politecnico di Torino)
Model Card Contact
Model tree for taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison
Base model
defog/sqlcoder-7b-2