pszemraj
/

led-base-book-summary

@@ -423,51 +423,55 @@ model-index:
       verified: true
       verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNDQxZmEwYmU5MGI1ZWE5NTIyMmM1MTVlMjVjNTg4MDQyMjJhNGE5NDJhNmZiN2Y4ZDc4ZmExNjBkMjQzMjQxMyIsInZlcnNpb24iOjF9.o3WblPY-iL1vT66xPwyyi1VMPhI53qs9GJ5HsHGbglOALwZT4n2-6IRxRNcL2lLj9qUehWUKkhruUyDM5-4RBg
 ---
-# Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization
 <a href="https://colab.research.google.com/gist/pszemraj/36950064ca76161d9d258e5cdbfa6833/led-base-demo-token-batching.ipynb">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
 </a>
-- **What:** This is the (current) result of the quest for a summarization model that condenses technical/long information down well _in general, academic and narrative usage
-- **Use cases:** long narrative summarization (think stories - as the dataset intended), article/paper/textbook/other summarization, technical:simple summarization.
-  - Models trained on this dataset tend to also _explain_ what they are summarizing, which IMO is awesome.
-  - Works well on lots of text, and can hand 16384 tokens/batch.
-  - See examples in Colab demo linked above, or try the [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text)
--
-> Note: the API is set to generate a max of 64 tokens for runtime reasons, so the summaries may be truncated (depending on the length of input text). For best results use python as below.
-## About
-- Trained on the BookSum dataset released by SalesForce (this is what adds the `bsd-3-clause` license)
-- Trained for 16 epochs vs. [`pszemraj/led-base-16384-finetuned-booksum`](https://huggingface.co/pszemraj/led-base-16384-finetuned-booksum),
-  - parameters adjusted for _very_ fine-tuning type training (super low LR, etc)
-  - all the parameters for generation on the API are the same for easy comparison between versions.
-## Other Checkpoints on Booksum
-- See [led-large-book-summary](https://huggingface.co/pszemraj/led-large-book-summary) for LED-large trained on the same dataset.
 ---
-# Usage - Basics
-- it is recommended to use `encoder_no_repeat_ngram_size=3` when calling the pipeline object to improve summary quality.
-  - this param forces the model to use new vocabulary and create an abstractive summary otherwise it may l compile the best _extractive_ summary from the input provided.
-- create the pipeline object:
-```python
 import torch
 from transformers import pipeline
-hf_name = 'pszemraj/led-base-book-summary'
 summarizer = pipeline(
     "summarization",
@@ -476,23 +480,50 @@ summarizer = pipeline(
 )
 ```
-- put words into the pipeline object:
 ```python
 wall_of_text = "your words here"
 result = summarizer(
-           wall_of_text,
-           min_length=8,
-           max_length=256,
-           no_repeat_ngram_size=3,
-           encoder_no_repeat_ngram_size=3,
-           repetition_penalty=3.5,
-           num_beams=4,
-           do_sample=False,
-           early_stopping=True,
-    )
-print(result[0]['generated_text'])
 ```
----

       verified: true
       verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNDQxZmEwYmU5MGI1ZWE5NTIyMmM1MTVlMjVjNTg4MDQyMjJhNGE5NDJhNmZiN2Y4ZDc4ZmExNjBkMjQzMjQxMyIsInZlcnNpb24iOjF9.o3WblPY-iL1vT66xPwyyi1VMPhI53qs9GJ5HsHGbglOALwZT4n2-6IRxRNcL2lLj9qUehWUKkhruUyDM5-4RBg
 ---
+# LED-Based Summarization Model: Condensing Long and Technical Information
 <a href="https://colab.research.google.com/gist/pszemraj/36950064ca76161d9d258e5cdbfa6833/led-base-demo-token-batching.ipynb">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
 </a>
+The Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization is a model I developed, designed to condense extensive technical, academic, and narrative content.
+## Key Features and Use Cases
+- Ideal for summarizing long narratives, articles, papers, textbooks, and other technical documents.
+- Trained to also explain the summarized content, offering insightful output.
+- High capacity: Handles up to 16,384 tokens per batch.
+- Live demo available: [Colab demo](https://colab.research.google.com/gist/pszemraj/36950064ca76161d9d258e5cdbfa6833/led-base-demo-token-batching.ipynb) and [demo on Spaces](https://huggingface.co/spaces/pszemraj/summarize-long-text).
+> **Note:** The API is configured to generate a maximum of 64 tokens due to runtime constraints. For optimal results, use the Python approach detailed below.
+## Training Details
+The model was trained on the BookSum dataset released by SalesForce, which leads to the `bsd-3-clause` license. The training process involved 16 epochs with parameters tweaked to facilitate very fine-tuning-type training (super low learning rate).
+Model checkpoint: [`pszemraj/led-base-16384-finetuned-booksum`](https://huggingface.co/pszemraj/led-base-16384-finetuned-booksum).
+For comparison, all generation parameters for the API have been kept consistent across versions.
+## Other Related Checkpoints
+Apart from the LED-based model, I have also fine-tuned other models on `kmfoda/booksum`:
+- [Long-T5-Global-Base](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary)
+- [BigBird-Pegasus-Large-K](https://huggingface.co/pszemraj/bigbird-pegasus-large-K-booksum)
+- [Pegasus-X-Large](https://huggingface.co/pszemraj/pegasus-x-large-book-summary)
+- [Long-T5-Global-XL](https://huggingface.co/pszemraj/long-t5-tglobal-xl-16384-book-summary)
+There are also other variants on other datasets etc on my hf profile, feel free to try them out :)
 ---
+## Basic Usage
+I recommend using `encoder_no_repeat_ngram_size=3` when calling the pipeline object, as it enhances the summary quality by encouraging the use of new vocabulary and crafting an abstractive summary.
+Create the pipeline object:
+```python
 import torch
 from transformers import pipeline
+hf_name = "pszemraj/led-base-book-summary"
 summarizer = pipeline(
     "summarization",
 )
 ```
+Feed the text into the pipeline object:
 ```python
 wall_of_text = "your words here"
 result = summarizer(
+    wall_of_text,
+    min_length=8,
+    max_length=256,
+    no_repeat_ngram_size=3,
+    encoder_no_repeat_ngram_size=3,
+    repetition_penalty=3.5,
+    num_beams=4,
+    do_sample=False,
+    early_stopping=True,
+)
+print(result[0]["generated_text"])
 ```
+## Simplified Usage with TextSum
+To streamline the process of using this and other models, I've developed [a Python package utility](https://github.com/pszemraj/textsum) named `textsum`. This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
+Install TextSum:
+```bash
+pip install textsum
+```
+Then use it in Python with this model:
+```python
+from textsum.summarize import Summarizer
+model_name = "pszemraj/led-base-book-summary"
+summarizer = Summarizer(
+    model_name_or_path=model_name,  # you can use any Seq2Seq model on the Hub
+    token_batch_length=4096,  # how many tokens to batch summarize at a time
+)
+long_string = "This is a long string of text that will be summarized."
+out_str = summarizer.summarize_string(long_string)
+print(f"summary: {out_str}")
+```
+Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a shareable demo application. For detailed explanations and documentation, check the README or the wiki.
+---