{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ๐ง HRHUB v2.1 - Enhanced with LLM (FREE VERSION)\n", "\n", "## ๐ Project Overview\n", "\n", "**Bilateral HR Matching System with LLM-Powered Intelligence**\n", "\n", "### What's New in v2.1:\n", "- โ **FREE LLM**: Using Hugging Face Inference API (no cost)\n", "- โ **Job Level Classification**: Zero-shot & few-shot learning\n", "- โ **Structured Skills Extraction**: Pydantic schemas\n", "- โ **Match Explainability**: LLM-generated reasoning\n", "- โ **Flexible Data Loading**: Upload OR Google Drive\n", "\n", "### Tech Stack:\n", "```\n", "Embeddings: sentence-transformers (local, free)\n", "LLM: Hugging Face Inference API (free tier)\n", "Schemas: Pydantic\n", "Platform: Google Colab โ VS Code\n", "```\n", "\n", "---\n", "\n", "**Master's Thesis - Aalborg University** \n", "*Business Data Science Program* \n", "*December 2025*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## ๐ Step 1: Install Dependencies" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โ All packages installed!\n" ] } ], "source": [ "# Install required packages\n", "#!pip install -q sentence-transformers huggingface-hub pydantic plotly pyvis nbformat scikit-learn pandas numpy\n", "\n", "print(\"โ All packages installed!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## ๐ Step 2: Import Libraries" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โ Environment variables loaded from .env\n", "โ All libraries imported!\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import json\n", "import os\n", "from typing import List, Dict, Optional, Literal\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# ML & NLP\n", "from sentence_transformers import SentenceTransformer\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "\n", "# LLM Integration (FREE)\n", "from huggingface_hub import InferenceClient\n", "from pydantic import BaseModel, Field\n", "\n", "# Visualization\n", "import plotly.graph_objects as go\n", "from IPython.display import HTML, display\n", "\n", "# Configuration Settings\n", "from dotenv import load_dotenv\n", "\n", "# Carrega variรกveis do .env\n", "load_dotenv()\n", "print(\"โ Environment variables loaded from .env\")\n", "\n", "print(\"โ All libraries imported!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## ๐ Step 3: Configuration" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โ Configuration loaded!\n", "๐ง Embedding model: all-MiniLM-L6-v2\n", "๐ค LLM model: meta-llama/Llama-3.2-3B-Instruct\n", "๐ HF Token configured: Yes โ \n", "๐ Data path: ../csv_files/\n" ] } ], "source": [ "class Config:\n", " \"\"\"Centralized configuration for VS Code\"\"\"\n", " \n", " # Paths - VS Code structure\n", " CSV_PATH = '../csv_files/'\n", " PROCESSED_PATH = '../processed/'\n", " RESULTS_PATH = '../results/'\n", " \n", " # Embedding Model\n", " EMBEDDING_MODEL = 'all-MiniLM-L6-v2'\n", " \n", " # LLM Settings (FREE - Hugging Face)\n", " HF_TOKEN = os.getenv('HF_TOKEN', '') # โ Pega do .env\n", " LLM_MODEL = 'meta-llama/Llama-3.2-3B-Instruct'\n", " \n", " LLM_MAX_TOKENS = 1000\n", " \n", " # Matching Parameters\n", " TOP_K_MATCHES = 10\n", " SIMILARITY_THRESHOLD = 0.5\n", " RANDOM_SEED = 42\n", "\n", "np.random.seed(Config.RANDOM_SEED)\n", "\n", "print(\"โ Configuration loaded!\")\n", "print(f\"๐ง Embedding model: {Config.EMBEDDING_MODEL}\")\n", "print(f\"๐ค LLM model: {Config.LLM_MODEL}\")\n", "print(f\"๐ HF Token configured: {'Yes โ ' if Config.HF_TOKEN else 'No โ ๏ธ'}\")\n", "print(f\"๐ Data path: {Config.CSV_PATH}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## ๐๏ธ Step 4: Architecture - Text Builders\n", "\n", "**HIGH COHESION:** Each class has ONE responsibility\n", "**LOW COUPLING:** Classes don't depend on each other" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โ Text Builder classes loaded\n", " โข CandidateTextBuilder\n", " โข CompanyTextBuilder\n" ] } ], "source": [ "# ============================================================================\n", "# TEXT BUILDER CLASSES - Single Responsibility Principle\n", "# ============================================================================\n", "\n", "from abc import ABC, abstractmethod\n", "from typing import List\n", "\n", "class TextBuilder(ABC):\n", " \"\"\"Abstract base class for text builders\"\"\"\n", " \n", " @abstractmethod\n", " def build(self, row: pd.Series) -> str:\n", " \"\"\"Build text representation from DataFrame row\"\"\"\n", " pass\n", " \n", " def build_batch(self, df: pd.DataFrame) -> List[str]:\n", " \"\"\"Build text representations for entire DataFrame\"\"\"\n", " return df.apply(self.build, axis=1).tolist()\n", "\n", "\n", "class CandidateTextBuilder(TextBuilder):\n", " \"\"\"Builds text representation for candidates\"\"\"\n", " \n", " def __init__(self, fields: List[str] = None):\n", " self.fields = fields or [\n", " 'Category',\n", " 'skills',\n", " 'career_objective',\n", " 'degree_names',\n", " 'positions'\n", " ]\n", " \n", " def build(self, row: pd.Series) -> str:\n", " parts = []\n", " \n", " if row.get('Category'):\n", " parts.append(f\"Job Category: {row['Category']}\")\n", " \n", " if row.get('skills'):\n", " parts.append(f\"Skills: {row['skills']}\")\n", " \n", " if row.get('career_objective'):\n", " parts.append(f\"Objective: {row['career_objective']}\")\n", " \n", " if row.get('degree_names'):\n", " parts.append(f\"Education: {row['degree_names']}\")\n", " \n", " if row.get('positions'):\n", " parts.append(f\"Experience: {row['positions']}\")\n", " \n", " return ' '.join(parts)\n", "\n", "\n", "class CompanyTextBuilder(TextBuilder):\n", " \"\"\"Builds text representation for companies\"\"\"\n", " \n", " def __init__(self, include_postings: bool = True):\n", " self.include_postings = include_postings\n", " \n", " def build(self, row: pd.Series) -> str:\n", " parts = []\n", " \n", " if row.get('name'):\n", " parts.append(f\"Company: {row['name']}\")\n", " \n", " if row.get('description'):\n", " parts.append(f\"Description: {row['description']}\")\n", " \n", " if row.get('industries_list'):\n", " parts.append(f\"Industries: {row['industries_list']}\")\n", " \n", " if row.get('specialties_list'):\n", " parts.append(f\"Specialties: {row['specialties_list']}\")\n", " \n", " # Include job postings data (THE BRIDGE!)\n", " if self.include_postings:\n", " if row.get('required_skills'):\n", " parts.append(f\"Required Skills: {row['required_skills']}\")\n", " \n", " if row.get('posted_job_titles'):\n", " parts.append(f\"Job Titles: {row['posted_job_titles']}\")\n", " \n", " if row.get('experience_levels'):\n", " parts.append(f\"Experience: {row['experience_levels']}\")\n", " \n", " return ' '.join(parts)\n", "\n", "\n", "print(\"โ Text Builder classes loaded\")\n", "print(\" โข CandidateTextBuilder\")\n", "print(\" โข CompanyTextBuilder\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## ๐๏ธ Step 5: Architecture - Embedding Manager\n", "\n", "**Responsibility:** Generate, save, and load embeddings" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โ EmbeddingManager class loaded\n" ] } ], "source": [ "# ============================================================================\n", "# EMBEDDING MANAGER - Handles all embedding operations\n", "# ============================================================================\n", "\n", "from pathlib import Path\n", "from typing import Tuple, Optional\n", "\n", "class EmbeddingManager:\n", " \"\"\"Manages embedding generation, saving, and loading\"\"\"\n", " \n", " def __init__(self, model: SentenceTransformer, save_dir: str):\n", " self.model = model\n", " self.save_dir = Path(save_dir)\n", " self.save_dir.mkdir(parents=True, exist_ok=True)\n", " \n", " def _get_file_paths(self, entity_type: str) -> Tuple[Path, Path]:\n", " \"\"\"Get file paths for embeddings and metadata\"\"\"\n", " emb_file = self.save_dir / f\"{entity_type}_embeddings.npy\"\n", " meta_file = self.save_dir / f\"{entity_type}_metadata.pkl\"\n", " return emb_file, meta_file\n", " \n", " def exists(self, entity_type: str) -> bool:\n", " \"\"\"Check if embeddings exist for entity type\"\"\"\n", " emb_file, _ = self._get_file_paths(entity_type)\n", " return emb_file.exists()\n", " \n", " def load(self, entity_type: str) -> Tuple[np.ndarray, pd.DataFrame]:\n", " \"\"\"Load embeddings and metadata\"\"\"\n", " emb_file, meta_file = self._get_file_paths(entity_type)\n", " \n", " if not emb_file.exists():\n", " raise FileNotFoundError(f\"Embeddings not found: {emb_file}\")\n", " \n", " embeddings = np.load(emb_file)\n", " metadata = pd.read_pickle(meta_file) if meta_file.exists() else None\n", " \n", " return embeddings, metadata\n", " \n", " def generate(self,\n", " texts: List[str],\n", " batch_size: int = 32,\n", " show_progress: bool = True) -> np.ndarray:\n", " \"\"\"Generate embeddings from texts\"\"\"\n", " return self.model.encode(\n", " texts,\n", " batch_size=batch_size,\n", " show_progress_bar=show_progress,\n", " normalize_embeddings=True,\n", " convert_to_numpy=True\n", " )\n", " \n", " def save(self,\n", " entity_type: str,\n", " embeddings: np.ndarray,\n", " metadata: pd.DataFrame) -> None:\n", " \"\"\"Save embeddings and metadata\"\"\"\n", " emb_file, meta_file = self._get_file_paths(entity_type)\n", " \n", " np.save(emb_file, embeddings)\n", " metadata.to_pickle(meta_file)\n", " \n", " print(f\"๐พ Saved:\")\n", " print(f\" {emb_file}\")\n", " print(f\" {meta_file}\")\n", " \n", " def generate_and_save(self,\n", " entity_type: str,\n", " texts: List[str],\n", " metadata: pd.DataFrame,\n", " batch_size: int = 32) -> np.ndarray:\n", " \"\"\"Generate embeddings and save everything\"\"\"\n", " print(f\"๐ Generating {entity_type} embeddings...\")\n", " print(f\" Processing {len(texts):,} items...\")\n", " \n", " embeddings = self.generate(texts, batch_size=batch_size)\n", " self.save(entity_type, embeddings, metadata)\n", " \n", " return embeddings\n", " \n", " def load_or_generate(self,\n", " entity_type: str,\n", " texts: List[str],\n", " metadata: pd.DataFrame,\n", " force_regenerate: bool = False) -> Tuple[np.ndarray, pd.DataFrame]:\n", " \"\"\"Load if exists, generate otherwise\"\"\"\n", " \n", " if not force_regenerate and self.exists(entity_type):\n", " print(f\"๐ฅ Loading {entity_type} embeddings...\")\n", " embeddings, saved_metadata = self.load(entity_type)\n", " \n", " # Verify alignment\n", " if len(embeddings) != len(metadata):\n", " print(f\"โ ๏ธ Size mismatch! Regenerating...\")\n", " embeddings = self.generate_and_save(\n", " entity_type, texts, metadata\n", " )\n", " else:\n", " print(f\"โ Loaded: {embeddings.shape}\")\n", " else:\n", " embeddings = self.generate_and_save(\n", " entity_type, texts, metadata\n", " )\n", " \n", " return embeddings, metadata\n", "\n", "\n", "print(\"โ EmbeddingManager class loaded\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## ๐๏ธ Step 6: Architecture - Matching Engine\n", "\n", "**Responsibility:** Calculate similarities and find matches" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โ MatchingEngine class loaded\n" ] } ], "source": [ "# ============================================================================\n", "# MATCHING ENGINE - Handles similarity calculations\n", "# ============================================================================\n", "\n", "class MatchingEngine:\n", " \"\"\"Calculates similarities and finds top matches\"\"\"\n", " \n", " def __init__(self,\n", " candidate_vectors: np.ndarray,\n", " company_vectors: np.ndarray,\n", " candidate_metadata: pd.DataFrame,\n", " company_metadata: pd.DataFrame):\n", " \n", " self.cand_vectors = candidate_vectors\n", " self.comp_vectors = company_vectors\n", " self.cand_metadata = candidate_metadata\n", " self.comp_metadata = company_metadata\n", " \n", " # Verify alignment\n", " assert len(candidate_vectors) == len(candidate_metadata), \\\n", " \"Candidate embeddings and metadata size mismatch\"\n", " assert len(company_vectors) == len(company_metadata), \\\n", " \"Company embeddings and metadata size mismatch\"\n", " \n", " def find_matches(self,\n", " candidate_idx: int,\n", " top_k: int = 10) -> List[Tuple[int, float]]:\n", " \"\"\"Find top K company matches for a candidate\"\"\"\n", " \n", " if candidate_idx >= len(self.cand_vectors):\n", " raise IndexError(f\"Candidate index {candidate_idx} out of range\")\n", " \n", " # Get candidate vector\n", " cand_vec = self.cand_vectors[candidate_idx].reshape(1, -1)\n", " \n", " # Calculate similarities\n", " similarities = cosine_similarity(cand_vec, self.comp_vectors)[0]\n", " \n", " # Get top K\n", " top_indices = np.argsort(similarities)[::-1][:top_k]\n", " \n", " # Return (index, score) tuples\n", " return [(int(idx), float(similarities[idx])) for idx in top_indices]\n", " \n", " def get_match_details(self,\n", " candidate_idx: int,\n", " company_idx: int) -> dict:\n", " \"\"\"Get detailed match information\"\"\"\n", " \n", " candidate = self.cand_metadata.iloc[candidate_idx]\n", " company = self.comp_metadata.iloc[company_idx]\n", " \n", " # Calculate similarity\n", " cand_vec = self.cand_vectors[candidate_idx].reshape(1, -1)\n", " comp_vec = self.comp_vectors[company_idx].reshape(1, -1)\n", " similarity = float(cosine_similarity(cand_vec, comp_vec)[0][0])\n", " \n", " return {\n", " 'candidate': candidate.to_dict(),\n", " 'company': company.to_dict(),\n", " 'similarity_score': similarity\n", " }\n", " \n", " def batch_match(self,\n", " candidate_indices: List[int],\n", " top_k: int = 10) -> dict:\n", " \"\"\"Find matches for multiple candidates\"\"\"\n", " \n", " results = {}\n", " for idx in candidate_indices:\n", " results[idx] = self.find_matches(idx, top_k=top_k)\n", " \n", " return results\n", "\n", "\n", "print(\"โ MatchingEngine class loaded\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## ๐ Step 7: Load All Datasets" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐ Loading all datasets...\n", "\n", "======================================================================\n", "โ Candidates: 9,544 rows ร 35 columns\n", "โ Companies (base): 24,473 rows\n", "โ Company industries: 24,375 rows\n", "โ Company specialties: 169,387 rows\n", "โ Employee counts: 35,787 rows\n", "โ Postings: 123,849 rows ร 31 columns\n", "โ Job skills: 213,768 rows\n", "โ Job industries: 164,808 rows\n", "\n", "======================================================================\n", "โ All datasets loaded successfully!\n", "\n" ] } ], "source": [ "print(\"๐ Loading all datasets...\\n\")\n", "print(\"=\" * 70)\n", "\n", "# Load main datasets\n", "candidates = pd.read_csv(f'{Config.CSV_PATH}resume_data.csv')\n", "print(f\"โ Candidates: {len(candidates):,} rows ร {len(candidates.columns)} columns\")\n", "\n", "companies_base = pd.read_csv(f'{Config.CSV_PATH}companies.csv')\n", "print(f\"โ Companies (base): {len(companies_base):,} rows\")\n", "\n", "company_industries = pd.read_csv(f'{Config.CSV_PATH}company_industries.csv')\n", "print(f\"โ Company industries: {len(company_industries):,} rows\")\n", "\n", "company_specialties = pd.read_csv(f'{Config.CSV_PATH}company_specialities.csv')\n", "print(f\"โ Company specialties: {len(company_specialties):,} rows\")\n", "\n", "employee_counts = pd.read_csv(f'{Config.CSV_PATH}employee_counts.csv')\n", "print(f\"โ Employee counts: {len(employee_counts):,} rows\")\n", "\n", "postings = pd.read_csv(f'{Config.CSV_PATH}postings.csv', on_bad_lines='skip', engine='python')\n", "print(f\"โ Postings: {len(postings):,} rows ร {len(postings.columns)} columns\")\n", "\n", "# Optional datasets\n", "try:\n", " job_skills = pd.read_csv(f'{Config.CSV_PATH}job_skills.csv')\n", " print(f\"โ Job skills: {len(job_skills):,} rows\")\n", "except:\n", " job_skills = None\n", " print(\"โ ๏ธ Job skills not found (optional)\")\n", "\n", "try:\n", " job_industries = pd.read_csv(f'{Config.CSV_PATH}job_industries.csv')\n", " print(f\"โ Job industries: {len(job_industries):,} rows\")\n", "except:\n", " job_industries = None\n", " print(\"โ ๏ธ Job industries not found (optional)\")\n", "\n", "print(\"\\n\" + \"=\" * 70)\n", "print(\"โ All datasets loaded successfully!\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## ๐ Step 8: Merge & Enrich Company Data" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐ ENRICHING COMPANY DATA...\n", "================================================================================\n", "\n", "1๏ธโฃ Aggregating industries...\n", "โ Industries aggregated: 24,365 companies\n", "\n", "2๏ธโฃ Aggregating specialties...\n", "โ Specialties aggregated: 17,780 companies\n", "\n", "3๏ธโฃ Aggregating job posting skills...\n", "โ Skills aggregated: 126,807 job postings\n", "\n", "4๏ธโฃ Aggregating job postings...\n", "โ Job data aggregated: 24,474 companies\n", "\n", "5๏ธโฃ Merging all data...\n", "โ Shape: (24473, 17)\n", "\n", "6๏ธโฃ Filling nulls...\n", " โ name 1 โ 0\n", " โ description 297 โ 0\n", " โ industries_list 108 โ 0\n", " โ specialties_list 6,693 โ 0\n", " โ avg_med_salary 22,312 โ 0\n", " โ avg_max_salary 15,261 โ 0\n", "\n", "7๏ธโฃ Validation...\n", "================================================================================\n", "โ name 0 issues\n", "โ description 0 issues\n", "โ industries_list 0 issues\n", "โ specialties_list 0 issues\n", "โ required_skills 945 issues\n", "โ posted_job_titles 0 issues\n", "================================================================================\n", "โ ๏ธ ISSUES!\n", "\n", "Total: 24,473\n", "With postings: 24,473\n" ] } ], "source": [ "# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\n", "# CELL 8: Merge & Enrich Company Data + Empty Columns Validation\n", "# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\n", "\n", "print(\"๐ ENRICHING COMPANY DATA...\")\n", "print(\"=\" * 80)\n", "\n", "# ============================================================================\n", "# STEP 1: Aggregate Industries per Company\n", "# ============================================================================\n", "print(\"\\n1๏ธโฃ Aggregating industries...\")\n", "\n", "industries_grouped = company_industries.groupby('company_id')['industry'].apply(\n", " lambda x: ', '.join(x.dropna().astype(str).unique())\n", ").reset_index()\n", "industries_grouped.columns = ['company_id', 'industries_list']\n", "\n", "print(f\"โ Industries aggregated: {len(industries_grouped):,} companies\")\n", "\n", "# ============================================================================\n", "# STEP 2: Aggregate Specialties per Company\n", "# ============================================================================\n", "print(\"\\n2๏ธโฃ Aggregating specialties...\")\n", "\n", "specialties_grouped = company_specialties.groupby('company_id')['speciality'].apply(\n", " lambda x: ', '.join(x.dropna().astype(str).unique())\n", ").reset_index()\n", "specialties_grouped.columns = ['company_id', 'specialties_list']\n", "\n", "print(f\"โ Specialties aggregated: {len(specialties_grouped):,} companies\")\n", "\n", "# ============================================================================\n", "# STEP 3: Aggregate Skills from Job Postings\n", "# ============================================================================\n", "print(\"\\n3๏ธโฃ Aggregating job posting skills...\")\n", "\n", "if job_skills is not None:\n", " skills_df = pd.read_csv(f'{Config.CSV_PATH}skills.csv')\n", " \n", " job_skills_enriched = job_skills.merge(\n", " skills_df,\n", " on='skill_abr',\n", " how='left'\n", " )\n", " \n", " skills_per_posting = job_skills_enriched.groupby('job_id')['skill_name'].apply(\n", " lambda x: ', '.join(x.dropna().astype(str).unique())\n", " ).reset_index()\n", " skills_per_posting.columns = ['job_id', 'required_skills']\n", " \n", " print(f\"โ Skills aggregated: {len(skills_per_posting):,} job postings\")\n", "else:\n", " skills_per_posting = pd.DataFrame(columns=['job_id', 'required_skills'])\n", " print(\"โ ๏ธ Job skills not available\")\n", "\n", "# ============================================================================\n", "# STEP 4: Aggregate Job Posting Data per Company\n", "# ============================================================================\n", "print(\"\\n4๏ธโฃ Aggregating job postings...\")\n", "\n", "postings_enriched = postings.merge(skills_per_posting, on='job_id', how='left')\n", "\n", "job_data_grouped = postings_enriched.groupby('company_id').agg({\n", " 'title': lambda x: ', '.join(x.dropna().astype(str).unique()[:10]),\n", " 'required_skills': lambda x: ', '.join(x.dropna().astype(str).unique()),\n", " 'med_salary': 'mean',\n", " 'max_salary': 'mean',\n", " 'job_id': 'count'\n", "}).reset_index()\n", "\n", "job_data_grouped.columns = [\n", " 'company_id', 'posted_job_titles', 'required_skills', \n", " 'avg_med_salary', 'avg_max_salary', 'total_postings'\n", "]\n", "\n", "print(f\"โ Job data aggregated: {len(job_data_grouped):,} companies\")\n", "\n", "# ============================================================================\n", "# STEP 5: Merge Everything\n", "# ============================================================================\n", "print(\"\\n5๏ธโฃ Merging all data...\")\n", "\n", "companies_full = companies_base.copy()\n", "companies_full = companies_full.merge(industries_grouped, on='company_id', how='left')\n", "companies_full = companies_full.merge(specialties_grouped, on='company_id', how='left')\n", "companies_full = companies_full.merge(job_data_grouped, on='company_id', how='left')\n", "\n", "print(f\"โ Shape: {companies_full.shape}\")\n", "\n", "# ============================================================================\n", "# STEP 6: Fill Empty Columns\n", "# ============================================================================\n", "print(\"\\n6๏ธโฃ Filling nulls...\")\n", "\n", "fill_values = {\n", " 'name': 'Unknown Company',\n", " 'description': 'No description',\n", " 'industries_list': 'General',\n", " 'specialties_list': 'Not specified',\n", " 'required_skills': 'Not specified',\n", " 'posted_job_titles': 'Various',\n", " 'avg_med_salary': 0,\n", " 'avg_max_salary': 0,\n", " 'total_postings': 0\n", "}\n", "\n", "for col, val in fill_values.items():\n", " if col in companies_full.columns:\n", " before = companies_full[col].isna().sum()\n", " companies_full[col] = companies_full[col].fillna(val)\n", " if before > 0:\n", " print(f\" โ {col:25s} {before:>6,} โ 0\")\n", "\n", "# ============================================================================\n", "# STEP 7: Validation\n", "# ============================================================================\n", "print(\"\\n7๏ธโฃ Validation...\")\n", "print(\"=\" * 80)\n", "\n", "critical = ['name', 'description', 'industries_list', 'specialties_list', \n", " 'required_skills', 'posted_job_titles']\n", "\n", "ok = True\n", "for col in critical:\n", " if col in companies_full.columns:\n", " issues = companies_full[col].isna().sum() + (companies_full[col] == '').sum()\n", " print(f\"{'โ ' if issues == 0 else 'โ'} {col:25s} {issues} issues\")\n", " if issues > 0:\n", " ok = False\n", "\n", "print(\"=\" * 80)\n", "print(f\"{'๐ฏ PERFECT!' if ok else 'โ ๏ธ ISSUES!'}\")\n", "print(f\"\\nTotal: {len(companies_full):,}\")\n", "print(f\"With postings: {(companies_full['total_postings'] > 0).sum():,}\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐ FILLING MISSING REQUIRED SKILLS...\n", "================================================================================\n", "โ Loaded 35 unique skills\n", "๐ Found 0 companies with missing skills\n", "โ No missing skills to fill!\n", "\n", "================================================================================\n" ] } ], "source": [ "# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\n", "# CELL 9: Fill Missing Required Skills via Keyword Matching\n", "# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\n", "\n", "print(\"๐ FILLING MISSING REQUIRED SKILLS...\")\n", "print(\"=\" * 80)\n", "\n", "# Load skills reference\n", "skills_ref = pd.read_csv(f'{Config.CSV_PATH}skills.csv')\n", "skill_names = set(skills_ref['skill_name'].str.lower().unique())\n", "\n", "print(f\"โ Loaded {len(skill_names):,} unique skills\")\n", "\n", "# Find companies with empty required_skills\n", "empty_mask = (companies_full['required_skills'] == 'Not specified') | \\\n", " (companies_full['required_skills'].isna())\n", "empty_count = empty_mask.sum()\n", "\n", "print(f\"๐ Found {empty_count:,} companies with missing skills\")\n", "\n", "if empty_count > 0:\n", " print(f\"\\n๐ Extracting skills from job postings text...\")\n", " \n", " # Get postings for companies with empty skills\n", " empty_companies = companies_full[empty_mask]['company_id'].tolist()\n", " relevant_postings = postings[postings['company_id'].isin(empty_companies)].copy()\n", " \n", " print(f\" Processing {len(relevant_postings):,} job postings...\")\n", " \n", " # Extract skills from description\n", " def extract_skills_from_text(text):\n", " if pd.isna(text):\n", " return []\n", " \n", " text_lower = str(text).lower()\n", " found_skills = []\n", " \n", " for skill in skill_names:\n", " if skill in text_lower:\n", " found_skills.append(skill)\n", " \n", " return found_skills\n", " \n", " # Extract from description column\n", " relevant_postings['extracted_skills'] = relevant_postings['description'].apply(extract_skills_from_text)\n", " \n", " # Aggregate by company\n", " skills_extracted = relevant_postings.groupby('company_id')['extracted_skills'].apply(\n", " lambda x: ', '.join(set([skill for sublist in x for skill in sublist]))\n", " ).reset_index()\n", " skills_extracted.columns = ['company_id', 'extracted_skills']\n", " \n", " # Update companies_full\n", " for idx, row in skills_extracted.iterrows():\n", " comp_id = row['company_id']\n", " extracted = row['extracted_skills']\n", " \n", " if extracted: # Only update if we found skills\n", " mask = companies_full['company_id'] == comp_id\n", " companies_full.loc[mask, 'required_skills'] = extracted\n", " \n", " # Final check\n", " still_empty = ((companies_full['required_skills'] == 'Not specified') | \n", " (companies_full['required_skills'].isna())).sum()\n", " \n", " filled = empty_count - still_empty\n", " \n", " print(f\"\\nโ RESULTS:\")\n", " print(f\" Filled: {filled:,} companies\")\n", " print(f\" Still empty: {still_empty:,} companies\")\n", " print(f\" Success rate: {(filled/empty_count*100):.1f}%\")\n", "\n", "else:\n", " print(\"โ No missing skills to fill!\")\n", "\n", "print(\"\\n\" + \"=\" * 80)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐ VALIDATING JOB POSTING ENRICHMENT...\n", "================================================================================\n", "\n", "๐ COVERAGE:\n", " Total companies: 24,473\n", " With postings: 24,473\n", " Without postings: 0\n", " Coverage: 100.0%\n", "\n", "๐ SAMPLE COMPANIES (random 5):\n", "--------------------------------------------------------------------------------\n", "\n", "๐ข PulsePoint\n", " Total Postings: 5\n", " Industries: Advertising Services...\n", " Required Skills: Product Management, Customer Service, Advertising, Other, Sales, Analyst...\n", " Job Titles: Senior Product Manager, DSP, Account Manager, SaaS/Health Data/Analytics (Signal...\n", "\n", "๐ข UFCW\n", " Total Postings: 2\n", " Industries: Non-profit Organizations...\n", " Required Skills: Information Technology, Design, Art/Creative, Information Technology...\n", " Job Titles: Records Manager, Digital Ads Specialist...\n", "\n", "๐ข Solo Printing, LLC \n", " Total Postings: 1\n", " Industries: Printing Services...\n", " Required Skills: Sales, Business Development...\n", " Job Titles: Junior Account Executive (Sales)...\n", "\n", "๐ข Franklin Street\n", " Total Postings: 2\n", " Industries: Real Estate...\n", " Required Skills: Accounting/Auditing, Finance, Business Development, Sales...\n", " Job Titles: Accounting Intern, Client Services Coordinator...\n", "\n", "๐ข Sonic Automotive\n", " Total Postings: 7\n", " Industries: Motor Vehicle Manufacturing...\n", " Required Skills: Other, Customer Service...\n", " Job Titles: Parts Counterperson - Baytown Ford, Service BDC Associate - Carson Honda, Parts ...\n", "\n", "\n", "๐ ENRICHMENT QUALITY CHECK:\n", "--------------------------------------------------------------------------------\n", "industries_list Filled: 24,365 ( 99.6%) Empty: 108\n", "specialties_list Filled: 17,780 ( 72.7%) Empty: 6,693\n", "required_skills Filled: 24,473 (100.0%) Empty: 0\n", "posted_job_titles Filled: 24,472 (100.0%) Empty: 1\n", "\n", "================================================================================\n", "\n", "๐ฏ CONCLUSION:\n", " โ If 'Filled' percentages are high โ Enrichment working!\n", " โ If 'Empty' counts are high โ Need to fix enrichment\n" ] } ], "source": [ "# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\n", "# VALIDATION: Check Job Posting Enrichment\n", "# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ\n", "\n", "print(\"๐ VALIDATING JOB POSTING ENRICHMENT...\")\n", "print(\"=\" * 80)\n", "\n", "# Stats\n", "print(f\"\\n๐ COVERAGE:\")\n", "print(f\" Total companies: {len(companies_full):,}\")\n", "print(f\" With postings: {(companies_full['total_postings'] > 0).sum():,}\")\n", "print(f\" Without postings: {(companies_full['total_postings'] == 0).sum():,}\")\n", "print(f\" Coverage: {(companies_full['total_postings'] > 0).sum() / len(companies_full) * 100:.1f}%\")\n", "\n", "# Sample companies\n", "sample = companies_full.sample(5, random_state=42)\n", "\n", "print(\"\\n๐ SAMPLE COMPANIES (random 5):\")\n", "print(\"-\" * 80)\n", "\n", "for idx, row in sample.iterrows():\n", " print(f\"\\n๐ข {row['name']}\")\n", " print(f\" Total Postings: {row['total_postings']}\")\n", " print(f\" Industries: {str(row['industries_list'])[:80]}...\")\n", " print(f\" Required Skills: {str(row['required_skills'])[:80]}...\")\n", " print(f\" Job Titles: {str(row['posted_job_titles'])[:80]}...\")\n", "\n", "# Check if enrichment columns exist and are populated\n", "print(\"\\n\\n๐ ENRICHMENT QUALITY CHECK:\")\n", "print(\"-\" * 80)\n", "\n", "enrichment_cols = ['industries_list', 'specialties_list', 'required_skills', 'posted_job_titles']\n", "\n", "for col in enrichment_cols:\n", " empty = (companies_full[col] == 'Not specified') | (companies_full[col] == 'Various') | (companies_full[col] == 'General')\n", " empty_count = empty.sum()\n", " filled_count = len(companies_full) - empty_count\n", " \n", " print(f\"{col:25s} Filled: {filled_count:>6,} ({filled_count/len(companies_full)*100:>5.1f}%) Empty: {empty_count:>6,}\")\n", "\n", "print(\"\\n\" + \"=\" * 80)\n", "print(\"\\n๐ฏ CONCLUSION:\")\n", "print(\" โ If 'Filled' percentages are high โ Enrichment working!\")\n", "print(\" โ If 'Empty' counts are high โ Need to fix enrichment\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | company_id | \n", "name | \n", "description | \n", "company_size | \n", "state | \n", "country | \n", "city | \n", "zip_code | \n", "address | \n", "url | \n", "industries_list | \n", "specialties_list | \n", "posted_job_titles | \n", "required_skills | \n", "avg_med_salary | \n", "avg_max_salary | \n", "total_postings | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1009 | \n", "IBM | \n", "At IBM, we do more than work. We create. We cr... | \n", "7.0 | \n", "NY | \n", "US | \n", "Armonk, New York | \n", "10504 | \n", "International Business Machines Corp. | \n", "https://www.linkedin.com/company/ibm | \n", "IT Services and IT Consulting | \n", "Cloud, Mobile, Cognitive, Security, Research, ... | \n", "Business Sales & Delivery Executive - SAP, Pro... | \n", "Information Technology, Product Management, Ot... | \n", "0.0 | \n", "182095.906250 | \n", "33 | \n", "
| 1 | \n", "1016 | \n", "GE HealthCare | \n", "Every day millions of people feel the impact o... | \n", "7.0 | \n", "0 | \n", "US | \n", "Chicago | \n", "0 | \n", "- | \n", "https://www.linkedin.com/company/gehealthcare | \n", "Hospitals and Health Care | \n", "Healthcare, Biotechnology | \n", "VP of Engineering, Demand Planning Leader - MR... | \n", "Engineering, Information Technology, Other, Pr... | \n", "0.0 | \n", "232626.222222 | \n", "53 | \n", "
| 2 | \n", "1025 | \n", "Hewlett Packard Enterprise | \n", "Official LinkedIn of Hewlett Packard Enterpris... | \n", "7.0 | \n", "Texas | \n", "US | \n", "Houston | \n", "77389 | \n", "1701 E Mossy Oaks Rd Spring | \n", "https://www.linkedin.com/company/hewlett-packa... | \n", "IT Services and IT Consulting | \n", "Not specified | \n", "Federal IT Call Center Technician (TS/SCI, Ful... | \n", "Information Technology, Project Management, In... | \n", "0.0 | \n", "208231.454545 | \n", "14 | \n", "
| 3 | \n", "1028 | \n", "Oracle | \n", "Weโre a cloud technology company that provides... | \n", "7.0 | \n", "Texas | \n", "US | \n", "Austin | \n", "78741 | \n", "2300 Oracle Way | \n", "https://www.linkedin.com/company/oracle | \n", "IT Services and IT Consulting | \n", "enterprise, software, applications, database, ... | \n", "Associate, Corporate Development, Customer Suc... | \n", "Business Development, Sales, Other, Research, ... | \n", "0.0 | \n", "122895.069298 | \n", "93 | \n", "
| 4 | \n", "1033 | \n", "Accenture | \n", "Accenture is a leading global professional ser... | \n", "7.0 | \n", "0 | \n", "IE | \n", "Dublin 2 | \n", "0 | \n", "Grand Canal Harbour | \n", "https://www.linkedin.com/company/accenture | \n", "Business Consulting and Services | \n", "Management Consulting, Systems Integration and... | \n", "Workday Certified Project Manager โ Midwest MU... | \n", "Strategy/Planning, Information Technology, Str... | \n", "0.0 | \n", "216110.266667 | \n", "20 | \n", "
Category: {category}
\n", "Top Skills:
{skills}...
Experience:
{experience}...
Industry: {industry}
\n", "Specialties: {specialties}
\n", "Required Skills:
{required_skills}...
Total Job Postings: {total_postings}
\n", "