Dataset Viewer

you can ecces the a 15,000 rows random sample of my data set throhg the data link at the right ->

HOMO-LUMO Energy Gap Prediction for Small Organic Molecules

Research Question

The research question of this project is:

Can we predict the HOMO-LUMO energy gap of small organic molecules using molecular composition, structural descriptors, and more?

The goal of this project is to build machine learning models that can predict the HOMO-LUMO energy gap of small organic molecules.

The HOMO-LUMO gap is the energy difference between the Highest Occupied Molecular Orbital and the Lowest Unoccupied Molecular Orbital. This gap is important because it is related to molecular stability, reactivity, and electronic behavior.

The first task in the project is a regression task, because the HOMO-LUMO gap is a continuous numerical value. Later, the same target was also converted into a classification task with three groups: Low Gap, Medium Gap, and High Gap.

Dataset Summary

The dataset contains information about small organic molecules. Each row represents one molecule, and each column contains a molecular feature or descriptor.

The target variable is the HOMO-LUMO energy gap.

The features include different types of molecular and chemical information, such as:

Atom counts, for example num_H, num_C, num_N, num_O, and num_F
Molecular composition features
Structural descriptors
Spatial descriptors such as x_range, y_range, z_range, and spatial_range_sum
Ratio features such as hydrogen_ratio, carbon_ratio, and hetero_atom_ratio
Energy-related features such as u0, u298, h298, g298, and cv
Other numerical descriptors such as molecular_weight, atom_count, and heavy_atom_count

The main goal was to check whether these molecular features contain enough useful information to predict the HOMO-LUMO gap.

Part 2: Exploratory Data Analysis

In the EDA stage, I explored the data to understand relationships between features and the target. I focused mainly on numeric variables because the target is numeric and most molecular descriptors were numerical.

The EDA included:

Summary statistics
Correlation analysis
Heatmap of numeric feature correlations
Feature correlation with the HOMO-LUMO gap
Outlier checks
General understanding of which features may help prediction

Correlation Heatmap of Numeric Features

This heatmap shows the correlation between numeric features in the dataset. Some features are strongly correlated with each other, especially molecular size, atom count, and energy-related features. This helped identify relationships between variables and possible data leakage risks.

Correlation of Features With HOMO-LUMO Gap

This graph shows which numeric features are most correlated with the HOMO-LUMO gap. Some important features were r2, hydrogen_ratio, num_H, u0, atom_count, num_C, and spatial_range_sum. This showed that the target is connected to both molecular composition and molecular structure.

Part 3: Baseline Regression Model

After the EDA, I trained a baseline model.

The baseline model was Linear Regression.

The purpose of the baseline was not to get the best possible result, but to create a simple first model that later models could be compared against.

The data was split into training and testing sets using a fixed random seed. The model was evaluated using standard regression metrics:

MAE
MSE
RMSE
R²

These metrics helped measure the size of the prediction errors and how well the model explained the variation in the HOMO-LUMO gap.

At first, the Linear Regression model gave results that were almost perfect. This was suspicious and later led to a deeper check for data leakage.

Part 4: Feature Engineering

Feature engineering was used to improve the dataset before training stronger models.

The main feature engineering steps were:

Scaling

Numeric features were scaled so they would be on a similar scale. This is useful for PCA, clustering, and linear models.

Polynomial Features

Polynomial features were created to help capture non-linear relationships between molecular descriptors.

PCA Features

PCA was used to reduce dimensionality and extract major patterns from the numeric features.

K-Means Clustering

K-Means clustering was used as an unsupervised learning method. The goal was to group molecules with similar molecular feature patterns.

The cluster results were then used to create new features, such as cluster ID and distance to the cluster centroid.

K-Means Clusters Visualized With PCA

This graph shows the K-Means clusters after reducing the data to two PCA components. The clusters show that molecules can be grouped based on similar molecular descriptor patterns. This was useful because the cluster information could be added as new features for the models.

Data Leakage Detection and Fix

One of the most important parts of the project was identifying and fixing data leakage.

At first, the Linear Regression model showed almost perfect performance. This did not look realistic, so I checked the features more carefully.

The problem was that some molecular features can be directly or indirectly related to the target. Since the target is the HOMO-LUMO gap, features such as HOMO, LUMO, and gap-related variables can create leakage.

For example:

HOMO-LUMO gap = LUMO - HOMO

If the model receives both HOMO and LUMO as input features, it can almost calculate the gap directly. In that case, the model is not really learning a useful prediction rule. It is just using leaked information.

To fix this issue:

Suspicious target-related features were removed
Highly correlated features were checked
PCA and clustering features were rebuilt only from safe features
The old leaky baseline was removed from the final comparison

After this correction, the model results became more realistic and reliable.

This was an important lesson from the project: strong model results are not always good if they come from data leakage.

Part 5: Improved Regression Models

After cleaning the data and rebuilding the engineered features, I trained three improved regression models:

Clean Linear Regression with engineered features
Clean Random Forest Regressor
Clean Gradient Boosting Regressor

The models were evaluated using MAE, MSE, RMSE, and R².

Final Clean Regression Results

Model	MAE	MSE	RMSE	R²
Clean Random Forest Regressor	0.508773	0.414282	0.643647	0.800866
Clean Linear Regression - Engineered Features	0.522288	0.434499	0.659166	0.791148
Clean Gradient Boosting Regressor	0.531217	0.453610	0.673506	0.781962

The best regression model was the Clean Random Forest Regressor.

It achieved the lowest RMSE and the highest R². This means it had the smallest prediction error and explained about 80% of the variation in the HOMO-LUMO gap.

Actual vs Predicted — Best Clean Model

This plot compares the true HOMO-LUMO gap values with the predicted values from the best clean model. Most points are close to the diagonal line, which means the model predicts the gap reasonably well.

Winning Regression Model: Actual vs Predicted

This graph also shows actual values compared to predicted values for the winning regression model. The predictions follow the real values closely in the main range of the data, although there are still some errors and outliers.

Part 7: Regression-to-Classification

After finishing the regression task, I reframed the problem as a classification task.

Instead of predicting the exact HOMO-LUMO gap value, I divided the target into three classes:

Class	Meaning
0	Low Gap
1	Medium Gap
2	High Gap

I used quantile binning with three classes:

Bottom 33% = Low Gap
Middle 33% = Medium Gap
Top 33% = High Gap

I chose this method because it gives almost the same amount of data in each class. This helps prevent the classifier from focusing only on one large class.

The class balance check showed that the classes were almost evenly distributed, which is good for classification.

Part 8: Classification Models

For the classification task, I trained three models:

Logistic Regression Classifier
Random Forest Classifier
Gradient Boosting Classifier

The models were evaluated using:

Accuracy
Precision
Recall
F1-score
Classification report
Confusion matrix

For this task, I focused mainly on F1 Macro, because it treats all three classes equally and balances precision and recall.

The best classification model was the Logistic Regression Classifier.

Metric	Value
Accuracy	0.712394
Precision Macro	0.717738
Recall Macro	0.713523
F1 Macro	0.715368

Classification Confusion Matrix

The confusion matrix shows how the winning classification model performed on Low, Medium, and High Gap classes.

The model performs best on Low Gap and High Gap molecules. Most errors happen between neighboring classes, such as Low vs Medium or Medium vs High. This makes sense because molecules near the class boundaries can be harder to classify.

Main Conclusions

This project showed that machine learning can predict the HOMO-LUMO energy gap of small organic molecules with meaningful accuracy.

The main conclusions are:

Molecular composition and structural descriptors are useful for predicting the HOMO-LUMO gap.
EDA showed that several numeric features are strongly related to the target.
Feature engineering improved the representation of the data.
Data leakage was a serious issue and had to be fixed before trusting the results.
The best regression model was the Clean Random Forest Regressor.
The best classification model was the Logistic Regression Classifier.
The regression model reached an R² of about 0.80, which shows strong predictive performance after leakage cleaning.

License

This project is released under the Apache License 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

jonblustein
/

chemical-molecules-gap