File size: 12,862 Bytes
671832a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92b47f6
 
 
 
671832a
 
a89c496
671832a
 
 
 
92b47f6
671832a
 
 
92b47f6
 
 
671832a
 
 
 
 
 
 
 
 
 
 
92b47f6
671832a
92b47f6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
library_name: transformers
language: en
license: apache-2.0
base_model: google/flan-t5-base
tags:
- generated_from_trainer
- text-preprocessing
- text-reformatting
datasets:
- other
model-index:
- name: flan-t5-base-paragrapher
  results: []
---

# flan-t5-base-paragrapher

This model is designed to preprocess, clean, and reformat text chunks containing line breaks, word breaks, and references into coherent plain text paragraphs. The resulting paragraphs can be used with other models like [agentlans/flan-t5-small-title](https://huggingface.co/agentlans/flan-t5-small-title) and [agentlans/text-summarization](https://huggingface.co/agentlans/text-summarization).

## Model description

The flan-t5-base-paragrapher is a fine-tuned version of [google/flan-t5-base](https://huggingface.co/google/flan-t5-base), trained on a dataset of open-source introductory social science textbooks. While it was trained on academic texts, it should work well with other types of educational and academic content.

The model achieves the following results on the evaluation set:
- Loss: 1.5175
- Number of Input Tokens Seen: 49 815 380

## Intended uses & limitations

This model is intended for preprocessing and reformatting text chunks into coherent paragraphs. It can be particularly useful for:

1. Cleaning up text extracted from PDFs or OCR systems
2. Reformatting text with irregular line breaks or word breaks
3. Preparing text for further processing or analysis

Limitations:
- The model may not perform optimally on highly specialized or technical texts outside its training domain.
- Very long input sequences may be truncated due to the model's maximum sequence length (512 tokens).

## Training and evaluation data

The model was trained on a dataset compiled from open-source textbooks. Due to licensing constraints, the specific training data is not published.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- Learning rate: 5e-05
- Train batch size: 8
- Eval batch size: 8
- Seed: 42
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- LR scheduler type: linear
- Number of epochs: 10.0

### Training results

<details>
<summary>Click to expand training results</summary>

| Training Loss | Epoch  | Step  | Validation Loss | Input Tokens Seen |
|:-------------:|:------:|:-----:|:---------------:|:-----------------:|
| 2.0748        | 0.1126 | 500   | 1.7587          | 562752            |
| 1.9699        | 0.2251 | 1000  | 1.7031          | 1119424           |
| 1.9177        | 0.3377 | 1500  | 1.6701          | 1676620           |
| 1.9179        | 0.4502 | 2000  | 1.6647          | 2244928           |
| 1.8908        | 0.5628 | 2500  | 1.6502          | 2806840           |
| 1.8666        | 0.6754 | 3000  | 1.6427          | 3364792           |
| 1.8456        | 0.7879 | 3500  | 1.6245          | 3925172           |
| 1.8542        | 0.9005 | 4000  | 1.6218          | 4490100           |
| 1.8305        | 1.0131 | 4500  | 1.6211          | 5052066           |
| 1.7588        | 1.1256 | 5000  | 1.6040          | 5607258           |
| 1.7606        | 1.2382 | 5500  | 1.6020          | 6165278           |
| 1.7426        | 1.3507 | 6000  | 1.5993          | 6727290           |
| 1.7477        | 1.4633 | 6500  | 1.5869          | 7292338           |
| 1.7413        | 1.5759 | 7000  | 1.5791          | 7849466           |
| 1.7342        | 1.6884 | 7500  | 1.5792          | 8415302           |
| 1.7247        | 1.8010 | 8000  | 1.5759          | 8970490           |
| 1.7423        | 1.9136 | 8500  | 1.5744          | 9529290           |
| 1.7138        | 2.0261 | 9000  | 1.5655          | 10091652          |
| 1.6719        | 2.1387 | 9500  | 1.5630          | 10650544          |
| 1.6637        | 2.2512 | 10000 | 1.5584          | 11208648          |
| 1.6415        | 2.3638 | 10500 | 1.5609          | 11776396          |
| 1.6565        | 2.4764 | 11000 | 1.5558          | 12338500          |
| 1.6597        | 2.5889 | 11500 | 1.5530          | 12897552          |
| 1.6709        | 2.7015 | 12000 | 1.5477          | 13460052          |
| 1.648         | 2.8140 | 12500 | 1.5424          | 14021984          |
| 1.642         | 2.9266 | 13000 | 1.5433          | 14586256          |
| 1.6258        | 3.0392 | 13500 | 1.5419          | 15140609          |
| 1.6067        | 3.1517 | 14000 | 1.5415          | 15700397          |
| 1.5946        | 3.2643 | 14500 | 1.5450          | 16265849          |
| 1.5835        | 3.3769 | 15000 | 1.5415          | 16827557          |
| 1.5996        | 3.4894 | 15500 | 1.5411          | 17384857          |
| 1.5834        | 3.6020 | 16000 | 1.5382          | 17945909          |
| 1.5956        | 3.7145 | 16500 | 1.5351          | 18507721          |
| 1.5825        | 3.8271 | 17000 | 1.5356          | 19069425          |
| 1.6001        | 3.9397 | 17500 | 1.5294          | 19631905          |
| 1.5677        | 4.0522 | 18000 | 1.5369          | 20185192          |
| 1.5415        | 4.1648 | 18500 | 1.5318          | 20739888          |
| 1.5362        | 4.2774 | 19000 | 1.5311          | 21304584          |
| 1.5251        | 4.3899 | 19500 | 1.5323          | 21862856          |
| 1.5388        | 4.5025 | 20000 | 1.5307          | 22427236          |
| 1.5508        | 4.6150 | 20500 | 1.5282          | 22985184          |
| 1.5692        | 4.7276 | 21000 | 1.5265          | 23548396          |
| 1.5391        | 4.8402 | 21500 | 1.5276          | 24111452          |
| 1.5431        | 4.9527 | 22000 | 1.5270          | 24673344          |
| 1.5147        | 5.0653 | 22500 | 1.5292          | 25236559          |
| 1.4908        | 5.1778 | 23000 | 1.5288          | 25799675          |
| 1.5153        | 5.2904 | 23500 | 1.5288          | 26352767          |
| 1.5099        | 5.4030 | 24000 | 1.5250          | 26916707          |
| 1.5064        | 5.5155 | 24500 | 1.5259          | 27483639          |
| 1.5146        | 5.6281 | 25000 | 1.5249          | 28040307          |
| 1.4938        | 5.7407 | 25500 | 1.5233          | 28600639          |
| 1.5034        | 5.8532 | 26000 | 1.5237          | 29164539          |
| 1.5091        | 5.9658 | 26500 | 1.5219          | 29730199          |
| 1.4853        | 6.0783 | 27000 | 1.5241          | 30286010          |
| 1.4797        | 6.1909 | 27500 | 1.5201          | 30840802          |
| 1.466         | 6.3035 | 28000 | 1.5238          | 31403710          |
| 1.4666        | 6.4160 | 28500 | 1.5226          | 31962730          |
| 1.4732        | 6.5286 | 29000 | 1.5199          | 32518854          |
| 1.4756        | 6.6412 | 29500 | 1.5219          | 33083634          |
| 1.4778        | 6.7537 | 30000 | 1.5195          | 33644482          |
| 1.4674        | 6.8663 | 30500 | 1.5182          | 34207738          |
| 1.4813        | 6.9788 | 31000 | 1.5202          | 34772050          |
| 1.4543        | 7.0914 | 31500 | 1.5211          | 35331657          |
| 1.4389        | 7.2040 | 32000 | 1.5221          | 35888749          |
| 1.4534        | 7.3165 | 32500 | 1.5215          | 36455101          |
| 1.4401        | 7.4291 | 33000 | 1.5208          | 37016889          |
| 1.4435        | 7.5416 | 33500 | 1.5212          | 37570517          |
| 1.4443        | 7.6542 | 34000 | 1.5205          | 38134577          |
| 1.4533        | 7.7668 | 34500 | 1.5209          | 38700917          |
| 1.4589        | 7.8793 | 35000 | 1.5218          | 39259257          |
| 1.4548        | 7.9919 | 35500 | 1.5185          | 39819093          |
| 1.4322        | 8.1045 | 36000 | 1.5207          | 40382907          |
| 1.4271        | 8.2170 | 36500 | 1.5220          | 40938983          |
| 1.4165        | 8.3296 | 37000 | 1.5203          | 41498811          |
| 1.4273        | 8.4421 | 37500 | 1.5197          | 42053427          |
| 1.4281        | 8.5547 | 38000 | 1.5195          | 42615135          |
| 1.4372        | 8.6673 | 38500 | 1.5197          | 43173055          |
| 1.4374        | 8.7798 | 39000 | 1.5175          | 43737723          |
| 1.4278        | 8.8924 | 39500 | 1.5211          | 44300547          |
| 1.442         | 9.0050 | 40000 | 1.5189          | 44864787          |
| 1.4235        | 9.1175 | 40500 | 1.5226          | 45418155          |
| 1.413         | 9.2301 | 41000 | 1.5220          | 45985195          |
| 1.4193        | 9.3426 | 41500 | 1.5201          | 46538675          |
| 1.414         | 9.4552 | 42000 | 1.5202          | 47101815          |
| 1.4084        | 9.5678 | 42500 | 1.5191          | 47655583          |
| 1.408         | 9.6803 | 43000 | 1.5207          | 48217371          |
| 1.4207        | 9.7929 | 43500 | 1.5200          | 48781351          |
| 1.4293        | 9.9054 | 44000 | 1.5198          | 49345155          |

</details>

### Framework versions

- Transformers 4.44.2
- PyTorch 2.5.1+cu124
- Datasets 3.1.0
- Tokenizers 0.19.1

## Usage

Here's an example of how to use the model:

```python
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("agentlans/flan-t5-base-paragrapher")
model = T5ForConditionalGeneration.from_pretrained(
    "agentlans/flan-t5-base-paragrapher", device_map="auto"
)

# Define input texts
# Note: These aren't real citations. Only for demonstration purpose.
input_texts = [
    """ge with a narrative—whether through books, films, or oral traditions—we are invited into another person's experience (Brown & Thompson, 2023). This immersion allows us to see the world through different perspectives, breaking down barriers of misunderstanding and prejudice. For example, novels like Harper Lee's "To Kill a Mockingbird" challenge readers to confront issues of racism and injustice through the eyes of a child (Williams, 2018). Similarly, contemporary works such as Chimamanda Ngozi Adichie's "Americanah" explore themes of identity and belonging in a globalized world (Nguyen & Roberts, 2020). By sharing these experiences through storytelling, authors can cultivate empathy in their audiences, encouraging them to reflect on their own beliefs and biases.
    Shaping Identity Through Narratives
    Stories also play a crucial role in shaping personal and collective identities. From childhood tales told by parents to the myths and legends that define cultural heritage, narratives help individuals understand their place in the world (Anderson & White, 2021). They provide frameworks thro""",
    """cia, M., & Patel, R. (2022). Cultural insights through literature: A comparative analysis. International Journal of Cultural Studies, 15(3), 201-215. Johnson, L., & Lee, H. (2019). Oral traditions: Preserving culture through storytelling. Anthropology Today Journal, 34(4), 56-60. Kumar, P. (2021). Epic tales: Literature as a reflection of society. Literary Critique Review, 29(1), 34-50. Lee, J., & Martinez, F. (2021). Voices unheard: Marginalized narratives in digital spaces. Journal of Digital Culture Studies, 7(2), 45-67. Martinez, C., & Chen, Y. (2022). Cultural navigation: Identity in a globalized world. Global Studies Review Jou""",
]

# Tokenize input texts
input_ids = tokenizer(
    input_texts, return_tensors="pt", padding=True, truncation=True
).input_ids.to("cuda")

# Generate outputs
outputs = model.generate(input_ids, max_length=512)

# Print generated outputs
for output in outputs:
    print(tokenizer.decode(output, skip_special_tokens=True) + "\n")
```

Example output:

```Through storytelling, we are invited into another person's experience, breaking down barriers of misunderstanding and prejudice. This immersion allows us to see the world through different perspectives, fostering empathy and re-evaluating our own beliefs and biases. For instance, Harper Lee's "To Kill a Mockingbird" challenges readers to confront issues of racism and injustice through the eyes of a child, while contemporary works like Chimamanda Ngozi Adichie's "Americanah" explore themes of identity and belonging in a globalized world. By sharing these experiences through storytelling, authors```

```The study of cultural insights through literature has yielded valuable insights into the world. Ci and Patel (2022) conducted a comparative analysis of cultural insights through literature, highlighting the importance of cultural storytelling in preserving culture. Kumar (2021) argued that oral traditions can preserve culture through storytelling, highlighting the importance of storytelling in preserving culture. Lee and Martinez (2021) explored marginalized narratives in digital spaces, highlighting the need for cultural navigation in a globalized world. These studies collectively demonstrate the importance of cultural navigation in fostering identity and identity in a globalized world.```