Update README.md
Browse files
README.md
CHANGED
|
@@ -423,51 +423,55 @@ model-index:
|
|
| 423 |
verified: true
|
| 424 |
verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNDQxZmEwYmU5MGI1ZWE5NTIyMmM1MTVlMjVjNTg4MDQyMjJhNGE5NDJhNmZiN2Y4ZDc4ZmExNjBkMjQzMjQxMyIsInZlcnNpb24iOjF9.o3WblPY-iL1vT66xPwyyi1VMPhI53qs9GJ5HsHGbglOALwZT4n2-6IRxRNcL2lLj9qUehWUKkhruUyDM5-4RBg
|
| 425 |
---
|
| 426 |
-
|
| 427 |
-
# Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization
|
| 428 |
-
|
| 429 |
|
| 430 |
<a href="https://colab.research.google.com/gist/pszemraj/36950064ca76161d9d258e5cdbfa6833/led-base-demo-token-batching.ipynb">
|
| 431 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
| 432 |
</a>
|
| 433 |
|
| 434 |
-
|
| 435 |
-
|
| 436 |
-
|
| 437 |
-
|
| 438 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 439 |
|
| 440 |
-
-
|
| 441 |
|
| 442 |
-
|
| 443 |
|
|
|
|
| 444 |
|
| 445 |
-
##
|
| 446 |
|
| 447 |
-
|
| 448 |
-
- Trained for 16 epochs vs. [`pszemraj/led-base-16384-finetuned-booksum`](https://huggingface.co/pszemraj/led-base-16384-finetuned-booksum),
|
| 449 |
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
|
| 454 |
|
| 455 |
-
|
| 456 |
|
| 457 |
---
|
| 458 |
|
| 459 |
-
|
| 460 |
|
| 461 |
-
|
| 462 |
-
- this param forces the model to use new vocabulary and create an abstractive summary otherwise it may l compile the best _extractive_ summary from the input provided.
|
| 463 |
-
- create the pipeline object:
|
| 464 |
|
| 465 |
-
|
| 466 |
|
|
|
|
| 467 |
import torch
|
| 468 |
from transformers import pipeline
|
| 469 |
|
| 470 |
-
hf_name =
|
| 471 |
|
| 472 |
summarizer = pipeline(
|
| 473 |
"summarization",
|
|
@@ -476,23 +480,50 @@ summarizer = pipeline(
|
|
| 476 |
)
|
| 477 |
```
|
| 478 |
|
| 479 |
-
|
| 480 |
|
| 481 |
```python
|
| 482 |
wall_of_text = "your words here"
|
| 483 |
|
| 484 |
result = summarizer(
|
| 485 |
-
|
| 486 |
-
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
|
| 490 |
-
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
|
| 494 |
-
|
| 495 |
-
print(result[0][
|
| 496 |
```
|
| 497 |
|
| 498 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 423 |
verified: true
|
| 424 |
verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNDQxZmEwYmU5MGI1ZWE5NTIyMmM1MTVlMjVjNTg4MDQyMjJhNGE5NDJhNmZiN2Y4ZDc4ZmExNjBkMjQzMjQxMyIsInZlcnNpb24iOjF9.o3WblPY-iL1vT66xPwyyi1VMPhI53qs9GJ5HsHGbglOALwZT4n2-6IRxRNcL2lLj9qUehWUKkhruUyDM5-4RBg
|
| 425 |
---
|
| 426 |
+
# LED-Based Summarization Model: Condensing Long and Technical Information
|
|
|
|
|
|
|
| 427 |
|
| 428 |
<a href="https://colab.research.google.com/gist/pszemraj/36950064ca76161d9d258e5cdbfa6833/led-base-demo-token-batching.ipynb">
|
| 429 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
| 430 |
</a>
|
| 431 |
|
| 432 |
+
The Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization is a model I developed, designed to condense extensive technical, academic, and narrative content.
|
| 433 |
+
|
| 434 |
+
## Key Features and Use Cases
|
| 435 |
+
|
| 436 |
+
- Ideal for summarizing long narratives, articles, papers, textbooks, and other technical documents.
|
| 437 |
+
- Trained to also explain the summarized content, offering insightful output.
|
| 438 |
+
- High capacity: Handles up to 16,384 tokens per batch.
|
| 439 |
+
- Live demo available: [Colab demo](https://colab.research.google.com/gist/pszemraj/36950064ca76161d9d258e5cdbfa6833/led-base-demo-token-batching.ipynb) and [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text).
|
| 440 |
+
|
| 441 |
+
> **Note:** The API is configured to generate a maximum of 64 tokens due to runtime constraints. For optimal results, use the Python approach detailed below.
|
| 442 |
+
|
| 443 |
+
## Training Details
|
| 444 |
|
| 445 |
+
The model was trained on the BookSum dataset released by SalesForce, which leads to the `bsd-3-clause` license. The training process involved 16 epochs with parameters tweaked to facilitate very fine-tuning-type training (super low learning rate).
|
| 446 |
|
| 447 |
+
Model checkpoint: [`pszemraj/led-base-16384-finetuned-booksum`](https://huggingface.co/pszemraj/led-base-16384-finetuned-booksum).
|
| 448 |
|
| 449 |
+
For comparison, all generation parameters for the API have been kept consistent across versions.
|
| 450 |
|
| 451 |
+
## Other Related Checkpoints
|
| 452 |
|
| 453 |
+
Apart from the LED-based model, I have also fine-tuned other models on `kmfoda/booksum`:
|
|
|
|
| 454 |
|
| 455 |
+
- [Long-T5-Global-Base](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary)
|
| 456 |
+
- [BigBird-Pegasus-Large-K](https://huggingface.co/pszemraj/bigbird-pegasus-large-K-booksum)
|
| 457 |
+
- [Pegasus-X-Large](https://huggingface.co/pszemraj/pegasus-x-large-book-summary)
|
| 458 |
+
- [Long-T5-Global-XL](https://huggingface.co/pszemraj/long-t5-tglobal-xl-16384-book-summary)
|
| 459 |
|
| 460 |
+
There are also other variants on other datasets etc on my hf profile, feel free to try them out :)
|
| 461 |
|
| 462 |
---
|
| 463 |
|
| 464 |
+
## Basic Usage
|
| 465 |
|
| 466 |
+
I recommend using `encoder_no_repeat_ngram_size=3` when calling the pipeline object, as it enhances the summary quality by encouraging the use of new vocabulary and crafting an abstractive summary.
|
|
|
|
|
|
|
| 467 |
|
| 468 |
+
Create the pipeline object:
|
| 469 |
|
| 470 |
+
```python
|
| 471 |
import torch
|
| 472 |
from transformers import pipeline
|
| 473 |
|
| 474 |
+
hf_name = "pszemraj/led-base-book-summary"
|
| 475 |
|
| 476 |
summarizer = pipeline(
|
| 477 |
"summarization",
|
|
|
|
| 480 |
)
|
| 481 |
```
|
| 482 |
|
| 483 |
+
Feed the text into the pipeline object:
|
| 484 |
|
| 485 |
```python
|
| 486 |
wall_of_text = "your words here"
|
| 487 |
|
| 488 |
result = summarizer(
|
| 489 |
+
wall_of_text,
|
| 490 |
+
min_length=8,
|
| 491 |
+
max_length=256,
|
| 492 |
+
no_repeat_ngram_size=3,
|
| 493 |
+
encoder_no_repeat_ngram_size=3,
|
| 494 |
+
repetition_penalty=3.5,
|
| 495 |
+
num_beams=4,
|
| 496 |
+
do_sample=False,
|
| 497 |
+
early_stopping=True,
|
| 498 |
+
)
|
| 499 |
+
print(result[0]["generated_text"])
|
| 500 |
```
|
| 501 |
|
| 502 |
+
## Simplified Usage with TextSum
|
| 503 |
+
|
| 504 |
+
To streamline the process of using this and other models, I've developed [a Python package utility](https://github.com/pszemraj/textsum) named `textsum`. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
|
| 505 |
+
|
| 506 |
+
Install TextSum:
|
| 507 |
+
|
| 508 |
+
```bash
|
| 509 |
+
pip install textsum
|
| 510 |
+
```
|
| 511 |
+
|
| 512 |
+
Then use it in Python with this model:
|
| 513 |
+
|
| 514 |
+
```python
|
| 515 |
+
from textsum.summarize import Summarizer
|
| 516 |
+
|
| 517 |
+
model_name = "pszemraj/led-base-book-summary"
|
| 518 |
+
summarizer = Summarizer(
|
| 519 |
+
model_name_or_path=model_name, # you can use any Seq2Seq model on the Hub
|
| 520 |
+
token_batch_length=4096, # how many tokens to batch summarize at a time
|
| 521 |
+
)
|
| 522 |
+
long_string = "This is a long string of text that will be summarized."
|
| 523 |
+
out_str = summarizer.summarize_string(long_string)
|
| 524 |
+
print(f"summary: {out_str}")
|
| 525 |
+
```
|
| 526 |
+
|
| 527 |
+
Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a shareable demo application. For detailed explanations and documentation, check the README or the wiki.
|
| 528 |
+
|
| 529 |
+
---
|