sbic-method2

An updated version of Standard-Based Impact Classification (SBIC) method of CSR report analysis in accordance with GRI framework

Here's a README section with instructions on how to run the code.

Multilabel Classification Step

This code performs report similarity search using cosine similarity, K-Nearest Neighbor (KNN) algorithm, and Sigmoid activation function to classify reports based on embeddings.

Prerequisites

Ensure you have the following installed before running the script:

Python 3.8+
Required Python libraries (install using the command below)

pip install numpy pandas torch sentence-transformers scikit-learn

Input Files

Before running the script, make sure you have the following input files in the working directory:

Patent Data Files:
- df_360k_41lables_05012023.csv
- german_plc_all_paragraphs_unnested_only.csv
Precomputed Embeddings:
- dataset for prediction:embeddings_paragraphs_07012023.pkl
- labeled dataset:embeddings_sentences_360k_09012023.pkl

Running the Script

Run the script using the following command:

python script.py

Processing Steps

The script follows these main steps:

Load Data & Pretrained Embeddings
Perform Cosine Similarity Search: Finds the most relevant reports (sentences) using semantic_search from sentence-transformers.
Apply K-Nearest Neighbor (KNN) Algorithm: Selects top similar reports (sentences) and aggregates predictions.
Use Sigmoid Activation for Classification: Applies a threshold to generate final classification outputs.
Save Results: Exports df_results_0_50k.csv containing the processed data.

Output File

The processed results will be saved in:

df_results_0_50k.csv

Execution Time

Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion.

ia-nechaev
/

sbic-method2