Update raw_data/readme.md

a464ea8 verified 9 months ago

1.06 kB

Labeled datasets for the Standard Based Impact Classification method extracted from approx. 230 CSR GRI reports (150 International companies, 2017-2021 period).

labels_ipnit_paragraphs 22k records
labels_ipnit_sentences 75k records
labels_iporit_paragraphs 57k recors
labels_iporit_sentences 193k records

Automatic labeling ipnit stands for "index-page AND in-text" criteria of label identification. iporit stands for "index-page OR in-text" criteria of label identification. index-page means the algorithm searches for index page within the pdf file, and extracts page numbers from there. in-text means the algorithm searching for the label using regex on each page of the report. ipnit represents more strict condition for considering the text as labeled, as it needs both of above methods to return true value. iporit is a more relaxed version, considering text as labeled when either of the two returns true value. No sifnificant increase in the accuracy of the prediction model was observed when using ipnit compared to iporit.