Dataset Viewer
you can ecces the a 15,000 rows random sample of my data set throhg the data link at the right ->
HOMO-LUMO Energy Gap Prediction for Small Organic Molecules
Research Question
The research question of this project is:
Can we predict the HOMO-LUMO energy gap of small organic molecules using molecular composition, structural descriptors, and more?
The goal of this project is to build machine learning models that can predict the HOMO-LUMO energy gap of small organic molecules.
The HOMO-LUMO gap is the energy difference between the Highest Occupied Molecular Orbital and the Lowest Unoccupied Molecular Orbital. This gap is important because it is related to molecular stability, reactivity, and electronic behavior.
The first task in the project is a regression task, because the HOMO-LUMO gap is a continuous numerical value. Later, the same target was also converted into a classification task with three groups: Low Gap, Medium Gap, and High Gap.
Dataset Summary
The dataset contains information about small organic molecules. Each row represents one molecule, and each column contains a molecular feature or descriptor.
The target variable is the HOMO-LUMO energy gap.
The features include different types of molecular and chemical information, such as:
- Atom counts, for example
num_H,num_C,num_N,num_O, andnum_F - Molecular composition features
- Structural descriptors
- Spatial descriptors such as
x_range,y_range,z_range, andspatial_range_sum - Ratio features such as
hydrogen_ratio,carbon_ratio, andhetero_atom_ratio - Energy-related features such as
u0,u298,h298,g298, andcv - Other numerical descriptors such as
molecular_weight,atom_count, andheavy_atom_count
The main goal was to check whether these molecular features contain enough useful information to predict the HOMO-LUMO gap.
Part 2: Exploratory Data Analysis
In the EDA stage, I explored the data to understand relationships between features and the target. I focused mainly on numeric variables because the target is numeric and most molecular descriptors were numerical.
The EDA included:
- Summary statistics
- Correlation analysis
- Heatmap of numeric feature correlations
- Feature correlation with the HOMO-LUMO gap
- Outlier checks
- General understanding of which features may help prediction
Correlation Heatmap of Numeric Features
This heatmap shows the correlation between numeric features in the dataset. Some features are strongly correlated with each other, especially molecular size, atom count, and energy-related features. This helped identify relationships between variables and possible data leakage risks.
Correlation of Features With HOMO-LUMO Gap
This graph shows which numeric features are most correlated with the HOMO-LUMO gap. Some important features were r2, hydrogen_ratio, num_H, u0, atom_count, num_C, and spatial_range_sum. This showed that the target is connected to both molecular composition and molecular structure.
Part 3: Baseline Regression Model
After the EDA, I trained a baseline model.
The baseline model was Linear Regression.
The purpose of the baseline was not to get the best possible result, but to create a simple first model that later models could be compared against.
The data was split into training and testing sets using a fixed random seed. The model was evaluated using standard regression metrics:
- MAE
- MSE
- RMSE
- R²
These metrics helped measure the size of the prediction errors and how well the model explained the variation in the HOMO-LUMO gap.
At first, the Linear Regression model gave results that were almost perfect. This was suspicious and later led to a deeper check for data leakage.
Part 4: Feature Engineering
Feature engineering was used to improve the dataset before training stronger models.
The main feature engineering steps were:
Scaling
Numeric features were scaled so they would be on a similar scale. This is useful for PCA, clustering, and linear models.
Polynomial Features
Polynomial features were created to help capture non-linear relationships between molecular descriptors.
PCA Features
PCA was used to reduce dimensionality and extract major patterns from the numeric features.
K-Means Clustering
K-Means clustering was used as an unsupervised learning method. The goal was to group molecules with similar molecular feature patterns.
The cluster results were then used to create new features, such as cluster ID and distance to the cluster centroid.
K-Means Clusters Visualized With PCA
This graph shows the K-Means clusters after reducing the data to two PCA components. The clusters show that molecules can be grouped based on similar molecular descriptor patterns. This was useful because the cluster information could be added as new features for the models.
Data Leakage Detection and Fix
One of the most important parts of the project was identifying and fixing data leakage.
At first, the Linear Regression model showed almost perfect performance. This did not look realistic, so I checked the features more carefully.
The problem was that some molecular features can be directly or indirectly related to the target. Since the target is the HOMO-LUMO gap, features such as HOMO, LUMO, and gap-related variables can create leakage.
For example:
HOMO-LUMO gap = LUMO - HOMO
If the model receives both HOMO and LUMO as input features, it can almost calculate the gap directly. In that case, the model is not really learning a useful prediction rule. It is just using leaked information.
To fix this issue:
- Suspicious target-related features were removed
- Highly correlated features were checked
- PCA and clustering features were rebuilt only from safe features
- The old leaky baseline was removed from the final comparison
After this correction, the model results became more realistic and reliable.
This was an important lesson from the project: strong model results are not always good if they come from data leakage.
Part 5: Improved Regression Models
After cleaning the data and rebuilding the engineered features, I trained three improved regression models:
- Clean Linear Regression with engineered features
- Clean Random Forest Regressor
- Clean Gradient Boosting Regressor
The models were evaluated using MAE, MSE, RMSE, and R².
Final Clean Regression Results
| Model | MAE | MSE | RMSE | R² |
|---|---|---|---|---|
| Clean Random Forest Regressor | 0.508773 | 0.414282 | 0.643647 | 0.800866 |
| Clean Linear Regression - Engineered Features | 0.522288 | 0.434499 | 0.659166 | 0.791148 |
| Clean Gradient Boosting Regressor | 0.531217 | 0.453610 | 0.673506 | 0.781962 |
The best regression model was the Clean Random Forest Regressor.
It achieved the lowest RMSE and the highest R². This means it had the smallest prediction error and explained about 80% of the variation in the HOMO-LUMO gap.
Actual vs Predicted — Best Clean Model
This plot compares the true HOMO-LUMO gap values with the predicted values from the best clean model. Most points are close to the diagonal line, which means the model predicts the gap reasonably well.
Winning Regression Model: Actual vs Predicted
This graph also shows actual values compared to predicted values for the winning regression model. The predictions follow the real values closely in the main range of the data, although there are still some errors and outliers.
Part 7: Regression-to-Classification
After finishing the regression task, I reframed the problem as a classification task.
Instead of predicting the exact HOMO-LUMO gap value, I divided the target into three classes:
| Class | Meaning |
|---|---|
| 0 | Low Gap |
| 1 | Medium Gap |
| 2 | High Gap |
I used quantile binning with three classes:
- Bottom 33% = Low Gap
- Middle 33% = Medium Gap
- Top 33% = High Gap
I chose this method because it gives almost the same amount of data in each class. This helps prevent the classifier from focusing only on one large class.
The class balance check showed that the classes were almost evenly distributed, which is good for classification.
Part 8: Classification Models
For the classification task, I trained three models:
- Logistic Regression Classifier
- Random Forest Classifier
- Gradient Boosting Classifier
The models were evaluated using:
- Accuracy
- Precision
- Recall
- F1-score
- Classification report
- Confusion matrix
For this task, I focused mainly on F1 Macro, because it treats all three classes equally and balances precision and recall.
The best classification model was the Logistic Regression Classifier.
| Metric | Value |
|---|---|
| Accuracy | 0.712394 |
| Precision Macro | 0.717738 |
| Recall Macro | 0.713523 |
| F1 Macro | 0.715368 |
Classification Confusion Matrix
The confusion matrix shows how the winning classification model performed on Low, Medium, and High Gap classes.
The model performs best on Low Gap and High Gap molecules. Most errors happen between neighboring classes, such as Low vs Medium or Medium vs High. This makes sense because molecules near the class boundaries can be harder to classify.
Main Conclusions
This project showed that machine learning can predict the HOMO-LUMO energy gap of small organic molecules with meaningful accuracy.
The main conclusions are:
- Molecular composition and structural descriptors are useful for predicting the HOMO-LUMO gap.
- EDA showed that several numeric features are strongly related to the target.
- Feature engineering improved the representation of the data.
- Data leakage was a serious issue and had to be fixed before trusting the results.
- The best regression model was the Clean Random Forest Regressor.
- The best classification model was the Logistic Regression Classifier.
- The regression model reached an R² of about 0.80, which shows strong predictive performance after leakage cleaning.
License
This project is released under the Apache License 2.0.





