File size: 1,564 Bytes
ec8f483
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Llama-3-3B CodeSearchNet Fine-tuned

This repository hosts a **Llama 3 (3B) model** fine-tuned on the **CodeSearchNet dataset**, which contains code in six programming languages.

## πŸ“ Model Details

- **Base Model**: Llama 3 (3B)
- **Fine-tuning Dataset**: CodeSearchNet
- **Languages Covered**: Python, Java, JavaScript, PHP, Ruby, Go
- **Training Method**: Supervised fine-tuning (SFT) with a contrastive loss objective for code search tasks
- **Tokenization**: Llama 3 tokenizer with additional tokens for code-specific keywords
- **Frameworks Used**: Hugging Face `transformers`, PyTorch, PEFT (for LoRA-based tuning)

## πŸ“š Dataset

The model is trained on the **CodeSearchNet** dataset, which contains:
- Function-level code snippets
- Paired natural language descriptions
- Multiple programming languages for multi-language search support

### **Dataset Sources**
- [CodeSearchNet Dataset](https://github.com/github/CodeSearchNet)
- Contains ~2M code snippets from open-source repositories

## πŸš€ Training Setup

- **Hardware**: NVIDIA A100 GPUs
- **Batch Size**: 16
- **Learning Rate**: 2e-5 with cosine annealing
- **Max Sequence Length**: 512
- **Fine-tuning Duration**: 3 epochs

## πŸ” Intended Use

- **Code Search**: Retrieve relevant code snippets given a natural language query
- **Code Completion**: Provide context-aware code suggestions
- **Code-to-Text Generation**: Explain code functionality in natural language
- **Multi-language Code Retrieval**: Search across different programming languages