sbic-method2 / README.md
ia-nechaev's picture
Update README.md (#3)
fa3b34f verified
|
raw
history blame
2.08 kB

sbic-method2

An updated version of Standard-Based Impact Classification (SBIC) method of CSR report analysis in accordance with GRI framework

Here's a README section with instructions on how to run the code.


Multilabel Classification Step

This code performs report similarity search using cosine similarity, K-Nearest Neighbor (KNN) algorithm, and Sigmoid activation function to classify reports based on embeddings.

Prerequisites

Ensure you have the following installed before running the script:

  • Python 3.8+
  • Required Python libraries (install using the command below)
pip install numpy pandas torch sentence-transformers scikit-learn

Input Files

Before running the script, make sure you have the following input files in the working directory:

  1. Patent Data Files:

    • df_360k_41lables_05012023.csv
    • german_plc_all_paragraphs_unnested_only.csv
  2. Precomputed Embeddings:

    • dataset for prediction:embeddings_paragraphs_07012023.pkl
    • labeled dataset:embeddings_sentences_360k_09012023.pkl

Running the Script

Run the script using the following command:

python script.py

Processing Steps

The script follows these main steps:

  1. Load Data & Pretrained Embeddings
  2. Perform Cosine Similarity Search: Finds the most relevant reports (sentences) using semantic_search from sentence-transformers.
  3. Apply K-Nearest Neighbor (KNN) Algorithm: Selects top similar reports (sentences) and aggregates predictions.
  4. Use Sigmoid Activation for Classification: Applies a threshold to generate final classification outputs.
  5. Save Results: Exports df_results_0_50k.csv containing the processed data.

Output File

The processed results will be saved in:

  • df_results_0_50k.csv

Execution Time

Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion.


license: gpl-3.0