hynky HF Staff commited on
Commit
16b5e40
·
verified ·
1 Parent(s): 6944a58

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -40
README.md CHANGED
@@ -4,17 +4,16 @@ language:
4
  - en
5
  license: apache-2.0
6
  datasets:
7
- - HuggingFaceFW/finepdfs_fw_edu_labeled
8
  ---
9
 
10
- # FinePDFs-Edu classifier (English)
11
 
12
  ## Model summary
13
- This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 0 [annotations](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_fw_edu_labeled) generated by [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) for web samples from [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset.
14
 
15
- We used this classifier to build [FinePDFs-Edu](https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu) dataset.
16
  ### How to use in transformers
17
- To load the FinePDFs-Edu classifier, use the following code:
18
 
19
  ```python
20
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
@@ -22,8 +21,8 @@ import re
22
  CHUNK_SIZE = 2048 - 2
23
  MAX_CHARS = 10_000
24
 
25
- tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/finepdfs_edu_classifier_English")
26
- model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceFW/finepdfs_edu_classifier_English")
27
  regex_whitespace = re.compile(r'\s')
28
 
29
  def create_text_chunks(text: str, tokenizer):
@@ -83,39 +82,37 @@ print(max(scores))
83
  ```
84
 
85
  ## Training
86
- The classifier was trained on 7740960 pairs of web samples and their scores from 0 to 5, generated by Qwen3-235B-A22B-Instruct-2507. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.
87
 
88
  Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:
89
  ```
90
- Below is an extract from a PDF file. Evaluate whether the extract has a high educational
91
- value and could be useful in an educational setting for teaching from primary school to
92
- grade school levels using the additive 5-point scoring system described below. Points are
93
- accumulated based on the satisfaction of each criterion:
94
- - Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and
95
- promotional material.
96
- - Add another point if the extract addresses certain elements pertinent to education but
97
- does not align closely with educational standards. It might mix educational content with
98
- non-educational material, offering a superficial overview of potentially useful topics, or
99
- presenting information in a disorganized manner and incoherent writing style.
100
- - Award a third point if the extract is appropriate for educational use and introduces key
101
- concepts relevant to school curricula. It is coherent though it may not be comprehensive
102
- or could include some extraneous information. It may resemble an introductory section of
103
- a textbook or a basic tutorial that is suitable for learning but has notable limitations like
104
- treating concepts that are too complex for grade school students.
105
- - Grant a fourth point if the extract highly relevant and beneficial for educational purposes
106
- for a level not higher than grade school, exhibiting a clear and consistent writing style. It
107
- could be similar to a chapter from a textbook or a tutorial, offering substantial educational
108
- content, including exercises and solutions, with minimal irrelevant information, and the
109
- concepts aren’t too advanced for grade school students. The content is coherent, focused,
110
- and valuable for structured learning.
111
- - Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for
112
- teaching either at primary school or grade school. It follows detailed reasoning, the writing
113
- style is easy to follow and offers profound and thorough insights into the subject matter,
114
- devoid of any non-educational or complex content.
115
- The extract: {example}.
116
  After examining the extract:
117
- - Briefly justify your total score, up to 100 words.
118
- - Conclude with the score using the format: "Educational score: <total points>"\
119
  ```
120
 
121
  We added a classification head with a single regression output to answerdotai/ModernBERT-large, unroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.
@@ -146,7 +143,7 @@ Validation Report:
146
 
147
  **Confusion matrix**
148
 
149
- We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
150
  ```
151
  Confusion Matrix:
152
  | class | 0 | 1 | 2 | 3 | 4 | 5 |
@@ -161,10 +158,10 @@ Confusion Matrix:
161
 
162
 
163
  ## Limitations
164
- While the FinePDFs-Edu classifier performs well in distinguishing high-quality educational content for FinePDFs dataset, there are some limitations:
165
 
166
- - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
167
- - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using int_score >= 1.35 (top 10% for english) as a threshold for data curation.
168
  - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
169
 
170
  The training and inference code is available on GitHub
 
4
  - en
5
  license: apache-2.0
6
  datasets:
7
+ - HuggingFaceFW/finepdfs_eng_Latn_labeled
8
  ---
9
 
10
+ # FinePDFs-DCLM classifier (English)
11
 
12
  ## Model summary
13
+ This is a classifier for judging the instructional/q&a value of web pages. It was developed to filter and curate instructional/q&a content from web datasets and was trained on 1304547 [annotations](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_eng_Latn_labeled) generated by [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) for web samples from [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset.
14
 
 
15
  ### How to use in transformers
16
+ To load the FinePDFs-DCLM classifier, use the following code:
17
 
18
  ```python
19
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
 
21
  CHUNK_SIZE = 2048 - 2
22
  MAX_CHARS = 10_000
23
 
24
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/finepdfs_dclm_classifier_English")
25
+ model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceFW/finepdfs_dclm_classifier_English")
26
  regex_whitespace = re.compile(r'\s')
27
 
28
  def create_text_chunks(text: str, tokenizer):
 
82
  ```
83
 
84
  ## Training
85
+ The classifier was trained on 7740960 pairs of web samples and their scores from 0 to 5, generated by Qwen3-235B-A22B-Instruct-2507. The samples were annotated based on their instruction/q&a quality with 0 being not instructional and 5 being highly instructional.
86
 
87
  Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:
88
  ```
89
+ Below is an extract from a PDF file. Evaluate whether the extract exhibits properties suitable for instruction-following or question-answering training data using the 6-point scoring system described below. Select the single score that best represents the extract's quality level:
90
+
91
+ **Score 0: Spam, Garbled, or Completely Unusable Content**
92
+ - Award 0 points for SEO spam content, promotional material with no educational value, completely garbled/corrupted text that is unreadable, random character sequences, or severely corrupted formatting that makes the content incomprehensible.
93
+
94
+ **Score 1: Simple Lists, Forms, or Minimal-Value Content**
95
+ - Award 1 point for content that has basic readable formatting but consists primarily of simple lists without context, forms, contact information, schedules, basic data tables without explanation, or other minimal-value structured content that lacks meaningful narrative or educational substance.
96
+
97
+ **Score 2: Cohesive Text Without Educational Value**
98
+ - Award 2 points if the extract contains cohesive, well-structured text that flows logically but lacks educational or instructional value. This includes meeting reports, business correspondence, letters, basic manual descriptions, administrative documents, or narrative content that doesn't teach or explain concepts.
99
+
100
+ **Score 3: Educational Content Without Q&A Structure**
101
+ - Award 3 points if the extract contains educational or informational content that could be useful for learning but doesn't follow a clear instructional format. This includes Wikipedia-style articles, research papers, academic content, encyclopedic entries, or explanatory text that presents information without explicit teaching structure.
102
+
103
+ **Score 4: Instructional Manuals and Structured Q&A**
104
+ - Award 4 points if the extract demonstrates clear instructional format with identifiable structure such as how-to guides, instruction manuals, structured question-answer pairs, problem-solution formats, or other organized pedagogical patterns. The content should be well-organized and follow recognizable educational conventions.
105
+
106
+ **Score 5: High-Quality Instructional Content with Explanations**
107
+ - Award 5 points if the extract exhibits exemplary instruction-response or question-answer properties with clear reasoning and detailed explanations. It should demonstrate thoughtful, step-by-step reasoning found in high-quality educational content like comprehensive tutorials, detailed explanations with context and reasoning, or expert-level instructional material that provides not just answers but explanatory reasoning and educational depth.
108
+
109
+ ## Evaluation Process
110
+
111
+ The extract: {example}
112
+
 
 
113
  After examining the extract:
114
+ - Briefly justify your total score, focusing on the content type and instructional/explanatory qualities, up to 100 words.
115
+ - Conclude with the score using the format: "Instruction/Q&A score: <total points>\
116
  ```
117
 
118
  We added a classification head with a single regression output to answerdotai/ModernBERT-large, unroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.
 
143
 
144
  **Confusion matrix**
145
 
146
+ We verify that the predicted dclm scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
147
  ```
148
  Confusion Matrix:
149
  | class | 0 | 1 | 2 | 3 | 4 | 5 |
 
158
 
159
 
160
  ## Limitations
161
+ While the FinePDFs-DCLM classifier performs well in distinguishing high-quality instructional content for FinePDFs dataset, there are some limitations:
162
 
163
+ - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on instruction/q&a content may not perform as well on specialized domains.
164
+ - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to instructional/q&a looking content for the higher scores and we recommend using int_score >= 3.5 as a threshold for data curation.
165
  - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
166
 
167
  The training and inference code is available on GitHub