| # Llama-3-3B CodeSearchNet Fine-tuned | |
| This repository hosts a **Llama 3 (3B) model** fine-tuned on the **CodeSearchNet dataset**, which contains code in six programming languages. | |
| ## π Model Details | |
| - **Base Model**: Llama 3 (3B) | |
| - **Fine-tuning Dataset**: CodeSearchNet | |
| - **Languages Covered**: Python, Java, JavaScript, PHP, Ruby, Go | |
| - **Training Method**: Supervised fine-tuning (SFT) with a contrastive loss objective for code search tasks | |
| - **Tokenization**: Llama 3 tokenizer with additional tokens for code-specific keywords | |
| - **Frameworks Used**: Hugging Face `transformers`, PyTorch, PEFT (for LoRA-based tuning) | |
| ## π Dataset | |
| The model is trained on the **CodeSearchNet** dataset, which contains: | |
| - Function-level code snippets | |
| - Paired natural language descriptions | |
| - Multiple programming languages for multi-language search support | |
| ### **Dataset Sources** | |
| - [CodeSearchNet Dataset](https://github.com/github/CodeSearchNet) | |
| - Contains ~2M code snippets from open-source repositories | |
| ## π Training Setup | |
| - **Hardware**: NVIDIA A100 GPUs | |
| - **Batch Size**: 16 | |
| - **Learning Rate**: 2e-5 with cosine annealing | |
| - **Max Sequence Length**: 512 | |
| - **Fine-tuning Duration**: 3 epochs | |
| ## π Intended Use | |
| - **Code Search**: Retrieve relevant code snippets given a natural language query | |
| - **Code Completion**: Provide context-aware code suggestions | |
| - **Code-to-Text Generation**: Explain code functionality in natural language | |
| - **Multi-language Code Retrieval**: Search across different programming languages |