{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# π§ HRHUB v2.1 - Enhanced with LLM (FREE VERSION)\n", "\n", "## π Project Overview\n", "\n", "**Bilateral HR Matching System with LLM-Powered Intelligence**\n", "\n", "### What's New in v2.1:\n", "- β **FREE LLM**: Using Hugging Face Inference API (no cost)\n", "- β **Job Level Classification**: Zero-shot & few-shot learning\n", "- β **Structured Skills Extraction**: Pydantic schemas\n", "- β **Match Explainability**: LLM-generated reasoning\n", "- β **Flexible Data Loading**: Upload OR Google Drive\n", "\n", "### Tech Stack:\n", "```\n", "Embeddings: sentence-transformers (local, free)\n", "LLM: Hugging Face Inference API (free tier)\n", "Schemas: Pydantic\n", "Platform: Google Colab β VS Code\n", "```\n", "\n", "---\n", "\n", "**Master's Thesis - Aalborg University** \n", "*Business Data Science Program* \n", "*December 2025*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## π¦ Step 1: Install Dependencies" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "β All packages installed!\n" ] } ], "source": [ "# Install required packages\n", "#!pip install -q sentence-transformers huggingface-hub pydantic plotly pyvis nbformat scikit-learn pandas numpy\n", "\n", "print(\"β All packages installed!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## π Step 2: Import Libraries" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "β Environment variables loaded from .env\n", "β All libraries imported!\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import json\n", "import os\n", "from typing import List, Dict, Optional, Literal\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# ML & NLP\n", "from sentence_transformers import SentenceTransformer\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "\n", "# LLM Integration (FREE)\n", "from huggingface_hub import InferenceClient\n", "from pydantic import BaseModel, Field\n", "\n", "# Visualization\n", "import plotly.graph_objects as go\n", "from IPython.display import HTML, display\n", "\n", "# Configuration Settings\n", "from dotenv import load_dotenv\n", "\n", "# Carrega variΓ‘veis do .env\n", "load_dotenv()\n", "print(\"β Environment variables loaded from .env\")\n", "# ============== ATΓ AQUI β¬οΈ ==============\n", "\n", "print(\"β All libraries imported!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## π§ Step 3: Configuration" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "β Configuration loaded!\n", "π§ Embedding model: all-MiniLM-L6-v2\n", "π€ LLM model: meta-llama/Llama-3.2-3B-Instruct\n", "π HF Token configured: Yes β \n", "π Data path: ../csv_files/\n" ] } ], "source": [ "class Config:\n", " \"\"\"Centralized configuration for VS Code\"\"\"\n", " \n", " # Paths - VS Code structure\n", " CSV_PATH = '../csv_files/'\n", " PROCESSED_PATH = '../processed/'\n", " RESULTS_PATH = '../results/'\n", " \n", " # Embedding Model\n", " EMBEDDING_MODEL = 'all-MiniLM-L6-v2'\n", " \n", " # LLM Settings (FREE - Hugging Face)\n", " HF_TOKEN = os.getenv('HF_TOKEN', '') # β Pega do .env\n", " LLM_MODEL = 'meta-llama/Llama-3.2-3B-Instruct'\n", " \n", " LLM_MAX_TOKENS = 1000\n", " \n", " # Matching Parameters\n", " TOP_K_MATCHES = 10\n", " SIMILARITY_THRESHOLD = 0.5\n", " RANDOM_SEED = 42\n", "\n", "np.random.seed(Config.RANDOM_SEED)\n", "\n", "print(\"β Configuration loaded!\")\n", "print(f\"π§ Embedding model: {Config.EMBEDDING_MODEL}\")\n", "print(f\"π€ LLM model: {Config.LLM_MODEL}\")\n", "print(f\"π HF Token configured: {'Yes β ' if Config.HF_TOKEN else 'No β οΈ'}\")\n", "print(f\"π Data path: {Config.CSV_PATH}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## π Step 5: Load All Datasets" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "π Loading all datasets...\n", "\n", "======================================================================\n", "β Candidates: 9,544 rows Γ 35 columns\n", "β Companies (base): 24,473 rows\n", "β Company industries: 24,375 rows\n", "β Company specialties: 169,387 rows\n", "β Employee counts: 35,787 rows\n", "β Postings: 123,849 rows Γ 31 columns\n", "β Job skills: 213,768 rows\n", "β Job industries: 164,808 rows\n", "\n", "======================================================================\n", "β All datasets loaded successfully!\n", "\n" ] } ], "source": [ "print(\"π Loading all datasets...\\n\")\n", "print(\"=\" * 70)\n", "\n", "# Load main datasets\n", "candidates = pd.read_csv(f'{Config.CSV_PATH}resume_data.csv')\n", "print(f\"β Candidates: {len(candidates):,} rows Γ {len(candidates.columns)} columns\")\n", "\n", "companies_base = pd.read_csv(f'{Config.CSV_PATH}companies.csv')\n", "print(f\"β Companies (base): {len(companies_base):,} rows\")\n", "\n", "company_industries = pd.read_csv(f'{Config.CSV_PATH}company_industries.csv')\n", "print(f\"β Company industries: {len(company_industries):,} rows\")\n", "\n", "company_specialties = pd.read_csv(f'{Config.CSV_PATH}company_specialities.csv')\n", "print(f\"β Company specialties: {len(company_specialties):,} rows\")\n", "\n", "employee_counts = pd.read_csv(f'{Config.CSV_PATH}employee_counts.csv')\n", "print(f\"β Employee counts: {len(employee_counts):,} rows\")\n", "\n", "postings = pd.read_csv(f'{Config.CSV_PATH}postings.csv', on_bad_lines='skip', engine='python')\n", "print(f\"β Postings: {len(postings):,} rows Γ {len(postings.columns)} columns\")\n", "\n", "# Optional datasets\n", "try:\n", " job_skills = pd.read_csv(f'{Config.CSV_PATH}job_skills.csv')\n", " print(f\"β Job skills: {len(job_skills):,} rows\")\n", "except:\n", " job_skills = None\n", " print(\"β οΈ Job skills not found (optional)\")\n", "\n", "try:\n", " job_industries = pd.read_csv(f'{Config.CSV_PATH}job_industries.csv')\n", " print(f\"β Job industries: {len(job_industries):,} rows\")\n", "except:\n", " job_industries = None\n", " print(\"β οΈ Job industries not found (optional)\")\n", "\n", "print(\"\\n\" + \"=\" * 70)\n", "print(\"β All datasets loaded successfully!\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## π Step 6: Merge & Enrich Company Data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "π Merging company data...\n", "\n", "β Aggregated industries for 24,365 companies\n", "β Aggregated specialties for 17,780 companies\n", "\n", "β Base company merge complete: 35,787 companies\n", "\n" ] } ], "source": [ "print(\"π Merging company data...\\n\")\n", "\n", "# Aggregate industries\n", "company_industries_agg = company_industries.groupby('company_id')['industry'].apply(\n", " lambda x: ', '.join(map(str, x.tolist()))\n", ").reset_index()\n", "company_industries_agg.columns = ['company_id', 'industries_list']\n", "print(f\"β Aggregated industries for {len(company_industries_agg):,} companies\")\n", "\n", "# Aggregate specialties\n", "company_specialties_agg = company_specialties.groupby('company_id')['speciality'].apply(\n", " lambda x: ' | '.join(x.astype(str).tolist())\n", ").reset_index()\n", "company_specialties_agg.columns = ['company_id', 'specialties_list']\n", "print(f\"β Aggregated specialties for {len(company_specialties_agg):,} companies\")\n", "\n", "# Merge all company data\n", "companies_merged = companies_base.copy()\n", "companies_merged = companies_merged.merge(company_industries_agg, on='company_id', how='left')\n", "companies_merged = companies_merged.merge(company_specialties_agg, on='company_id', how='left')\n", "companies_merged = companies_merged.merge(employee_counts, on='company_id', how='left')\n", "\n", "print(f\"\\nβ Base company merge complete: {len(companies_merged):,} companies\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## π Step 7: Enrich with Job Postings" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "π Enriching companies with job posting data...\n", "\n", "======================================================================\n", "KEY INSIGHT: Postings = 'Requirements Language Bridge'\n", "======================================================================\n", "\n", "β Enriched 35,787 companies with posting data\n", "\n" ] } ], "source": [ "print(\"π Enriching companies with job posting data...\\n\")\n", "print(\"=\" * 70)\n", "print(\"KEY INSIGHT: Postings = 'Requirements Language Bridge'\")\n", "print(\"=\" * 70 + \"\\n\")\n", "\n", "postings = postings.fillna('')\n", "postings['company_id'] = postings['company_id'].astype(str)\n", "\n", "# Aggregate postings per company\n", "postings_agg = postings.groupby('company_id').agg({\n", " 'title': lambda x: ' | '.join(x.astype(str).tolist()[:10]),\n", " 'description': lambda x: ' '.join(x.astype(str).tolist()[:5]),\n", " 'skills_desc': lambda x: ' | '.join(x.dropna().astype(str).tolist()),\n", " 'formatted_experience_level': lambda x: ' | '.join(x.dropna().unique().astype(str)),\n", "}).reset_index()\n", "\n", "postings_agg.columns = ['company_id', 'posted_job_titles', 'posted_descriptions', 'required_skills', 'experience_levels']\n", "\n", "companies_merged['company_id'] = companies_merged['company_id'].astype(str)\n", "companies_full = companies_merged.merge(postings_agg, on='company_id', how='left').fillna('')\n", "\n", "print(f\"β Enriched {len(companies_full):,} companies with posting data\\n\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | company_id | \n", "name | \n", "description | \n", "company_size | \n", "state | \n", "country | \n", "city | \n", "zip_code | \n", "address | \n", "url | \n", "industries_list | \n", "specialties_list | \n", "employee_count | \n", "follower_count | \n", "time_recorded | \n", "posted_job_titles | \n", "posted_descriptions | \n", "required_skills | \n", "experience_levels | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1009 | \n", "IBM | \n", "At IBM, we do more than work. We create. We cr... | \n", "7.0 | \n", "NY | \n", "US | \n", "Armonk, New York | \n", "10504 | \n", "International Business Machines Corp. | \n", "https://www.linkedin.com/company/ibm | \n", "IT Services and IT Consulting | \n", "Cloud | Mobile | Cognitive | Security | Resear... | \n", "314102 | \n", "16253625 | \n", "1712378162 | \n", "\n", " | \n", " | \n", " | \n", " |
| 1 | \n", "1009 | \n", "IBM | \n", "At IBM, we do more than work. We create. We cr... | \n", "7.0 | \n", "NY | \n", "US | \n", "Armonk, New York | \n", "10504 | \n", "International Business Machines Corp. | \n", "https://www.linkedin.com/company/ibm | \n", "IT Services and IT Consulting | \n", "Cloud | Mobile | Cognitive | Security | Resear... | \n", "313142 | \n", "16309464 | \n", "1713392385 | \n", "\n", " | \n", " | \n", " | \n", " |
| 2 | \n", "1009 | \n", "IBM | \n", "At IBM, we do more than work. We create. We cr... | \n", "7.0 | \n", "NY | \n", "US | \n", "Armonk, New York | \n", "10504 | \n", "International Business Machines Corp. | \n", "https://www.linkedin.com/company/ibm | \n", "IT Services and IT Consulting | \n", "Cloud | Mobile | Cognitive | Security | Resear... | \n", "313147 | \n", "16309985 | \n", "1713402495 | \n", "\n", " | \n", " | \n", " | \n", " |
| 3 | \n", "1009 | \n", "IBM | \n", "At IBM, we do more than work. We create. We cr... | \n", "7.0 | \n", "NY | \n", "US | \n", "Armonk, New York | \n", "10504 | \n", "International Business Machines Corp. | \n", "https://www.linkedin.com/company/ibm | \n", "IT Services and IT Consulting | \n", "Cloud | Mobile | Cognitive | Security | Resear... | \n", "311223 | \n", "16314846 | \n", "1713501255 | \n", "\n", " | \n", " | \n", " | \n", " |
| 4 | \n", "1016 | \n", "GE HealthCare | \n", "Every day millions of people feel the impact o... | \n", "7.0 | \n", "0 | \n", "US | \n", "Chicago | \n", "0 | \n", "- | \n", "https://www.linkedin.com/company/gehealthcare | \n", "Hospitals and Health Care | \n", "Healthcare | Biotechnology | \n", "56873 | \n", "2185368 | \n", "1712382540 | \n", "\n", " | \n", " | \n", " | \n", " |