Hekaya7 / handover.md
XA7's picture
First
e037628
|
raw
history blame
11.8 kB
# Comic Story Generator: Code Handover Document
**Date:** 2025-7-22
**Document Purpose:** This document provides a comprehensive technical handover for the Comic Story Generator project. It is intended for developers and future maintainers responsible for the deployment, maintenance, and extension of the application.
---
## 1. Project Overview
The Comic Story Generator is a web application that automatically creates multi-page, textless comic stories from a user-provided description. The application leverages generative AI to produce visually coherent narratives, focusing on character consistency, expressive emotion, and logical panel sequencing.
### 1.1. Core Functionality
The application is designed to translate a textual story concept into a purely visual comic strip. Key characteristics include:
* **AI-Powered Narrative:** Utilizes Google's Gemini to interpret the user's concept and break it down into a structured, panel-by-panel narrative.
* **Visual Generation:** Employs a GPT-based image model to render complete comic pages based on the AI-generated narrative structure.
* **Intelligent Panel Detection:** Uses Gemini Vision to analyze the generated full-page image and accurately detect the boundaries of each panel, ensuring precise splitting.
* **Customization:** Offers users control over the output, including:
* **Layout:** Choice of panel count (from 4 to 24).
* **Length:** Generation of 1 to 10 pages.
* **Art Style:** A selection of visual styles, including "Classic Comic," "Manga," "Cartoon," "Digital Paint," and a high-contrast "Accessible" style designed for users with special needs.
### 1.2. High-Level Workflow
The generation process follows a clear, multi-step pipeline:
1. **User Input:** The user submits a short description of the desired story.
2. **Story Generation:** The `StoryGenerator` component uses Gemini to create a detailed, scene-by-scene description for each comic panel.
3. **Page Generation:** The `ComicGenerator` takes the panel descriptions and instructs the GPT-Image model to generate a single, composite image representing a full comic page with panels arranged in a grid.
4. **Layout Analysis:** The generated page is passed to the `GeminiVision` component, which analyzes the image to identify the precise coordinates and boundaries of each panel.
5. **Panel Splitting:** The application uses the coordinates from the vision analysis to accurately split the composite image into individual panel images.
6. **Final Output:** The processed panels are presented to the user as a complete, multi-page visual story.
---
## 2. System Architecture
The application is built on a modular architecture composed of three primary classes, each responsible for a distinct part of the generation pipeline.
### 2.1. System Diagram
```mermaid
classDiagram
class StoryGenerator{
+generate_story(description: string) : list[string]
+enhance_visuals(panel_descriptions: list) : list[string]
}
class ComicGenerator{
+generate_page(panel_descriptions: list) : Image
+split_panels(page_image: Image, grid_layout: dict) : list[Image]
}
class GeminiVision{
+analyze_layout(page_image: Image) : dict
}
StoryGenerator "1" -- "1" ComicGenerator : Provides panel descriptions
ComicGenerator "1" -- "1" GeminiVision : Uses for layout analysis
```
### 2.2. Data Flow
The end-to-end data flow illustrates the interaction between the user, the application, and the underlying AI models.
```mermaid
sequenceDiagram
participant User
participant App
participant Gemini as Gemini (Text/Story)
participant GPTImage as GPT-Image (Visuals)
participant GeminiVision as Gemini Vision (Analysis)
User->>+App: Submits story description
App->>+Gemini: Requests story structure from description
Gemini-->>-App: Returns panel-by-panel text descriptions
App->>+GPTImage: Requests comic page generation from descriptions
GPTImage-->>-App: Returns single full-page image
App->>+GeminiVision: Requests layout analysis of the image
GeminiVision-->>-App: Returns coordinates of each panel
App->>User: Displays final, split-panel comic
```
---
## 3. Setup & Installation
### 3.1. Prerequisites
* **Python:** Version 3.9 or higher.
* **API Keys:**
* An active OpenAI API key.
* An active Google API key with access to the Gemini family of models.
### 3.2. Installation Steps
1. **Clone the Repository:**
```bash
git clone https://github.com/yourusername/Comic-Story-Generator.git
cd Comic-Story-Generator
```
2. **Create and Activate a Virtual Environment:**
```bash
# Create the environment
python -m venv venv
# Activate the environment (macOS/Linux)
source venv/bin/activate
# Or, activate on Windows
# venv\Scripts\activate
```
3. **Install Dependencies:**
```bash
pip install -r requirements.txt
```
4. **Configure Environment Variables:**
Create a `.env` file in the project root and add your API keys.
```bash
echo "OPENAI_API_KEY=your_openai_key" > .env
echo "GOOGLE_API_KEY=your_google_key" >> .env
```
*Note: Ensure the `.env` file is added to your `.gitignore` file to prevent committing secrets.*
---
## 4. Environment Variables / Secrets
The application requires the following environment variables to be set in a `.env` file at the project's root.
| Variable | Description | Required | Example |
| :--- | :--- | :--- | :--- |
| `OPENAI_API_KEY` | API key for the OpenAI service, used for GPT-Image generation. | Yes | `sk-xxxxxxxxxxxxxxxxxxxxxxxx` |
| `GOOGLE_API_KEY` | API key for Google AI services, used for Gemini (story structure) and Gemini Vision (layout analysis). | Yes | `AIzaSyxxxxxxxxxxxxxxxxxxxxx` |
---
## 5. How to Run
After completing the setup and installation steps, launch the application with the following command from the project's root directory:
```bash
python app.py
```
The application will start a local web server, and the interface will be accessible at the URL provided in the console (typically `http://127.0.0.1:7860`).
---
## 6. Deployment Instructions
[TODO] This section requires documentation for deploying the application to a production environment. Steps should include:
* Recommended hosting provider (e.g., AWS, Heroku, DigitalOcean).
* Instructions for setting up a production-grade web server (e.g., Gunicorn).
* Configuration of a reverse proxy (e.g., Nginx).
* Management of production environment variables/secrets.
* Process management (e.g., using `systemd`).
---
## 7. Core Components & Logic
The application logic is encapsulated in three main classes.
### 7.1. `StoryGenerator`
* **Responsibility:** Handles the narrative creation phase.
* **`generate_story()`:** Takes the raw user description as input. It constructs a prompt for the Gemini model to elicit a structured response containing a list of detailed text descriptions, one for each comic panel.
* **`enhance_visuals()`:** Processes the panel descriptions to add specific visual cues and optimizations, particularly for the "Accessible" style, ensuring high contrast and simplified object representation.
### 7.2. `ComicGenerator`
* **Responsibility:** Manages the visual generation and processing of the comic page.
* **`generate_page()`:** Aggregates the panel descriptions from `StoryGenerator` into a single, complex prompt for the GPT-Image model. This prompt instructs the AI to create one composite image with all panels laid out in a grid.
* **`split_panels()`:** Receives the generated page image and the layout data from `GeminiVision`. It uses this data to crop the page into individual panel images with high precision.
### 7.3. `GeminiVision`
* **Responsibility:** Performs visual analysis on the generated comic page.
* **`analyze_layout()`:** This is the core of the intelligent panel-splitting feature. It takes the full-page image as input and uses the Gemini Vision model to visually identify the boundaries of each panel. It returns a dictionary containing the coordinates and dimensions of the detected grid, which is more robust than assuming a fixed grid layout.
---
## 8. Third-party Dependencies
The complete list of Python packages is specified in `requirements.txt`. Key dependencies include:
* **`openai`**: Python client for the OpenAI API.
* **`google-generativeai`**: Python client for the Google AI (Gemini) API.
* **`python-dotenv`**: For loading environment variables from the `.env` file.
* **`Pillow`**: For image manipulation (cropping and saving).
* **[Info Needed]**: The web framework used to build `app.py` (e.g., `gradio`, `flask`, `fastapi`).
---
## 9. Testing Instructions
[TODO] A testing framework has not been established for this project. Future work should include:
* **Test Suite Setup:** Choose and configure a testing framework (e.g., `pytest`).
* **Unit Tests:** Create unit tests for individual methods in `StoryGenerator`, `ComicGenerator`, and `GeminiVision`. This should involve mocking the API calls to AI services to test the data processing logic in isolation.
* **Integration Tests:** Develop tests for the entire generation pipeline, from user input to final split panels.
* **Continuous Integration:** Set up a CI pipeline (e.g., using GitHub Actions) to run tests automatically on pull requests.
---
## 10. Troubleshooting & Common Issues
[TODO] This section should be populated as common issues are identified. Potential areas to document include:
* **API Key Errors:** Steps to verify that API keys are correctly configured and have the necessary permissions.
* **Incoherent Stories:** Guidance on how to write effective initial descriptions to improve narrative quality.
* **Poor Panel Splitting:** Troubleshooting steps for when Gemini Vision fails to detect the layout correctly (e.g., checking image complexity, trying a different art style).
* **Long Generation Times:** Explanation of typical performance and factors that can cause delays (e.g., API provider latency, number of panels).
---
## 11. TODOs / Future Work
Based on the project's focus areas, the following are key areas for future development and contribution:
* **Core Generation Logic:**
* Improve character consistency across multiple pages.
* Experiment with different AI models for potentially better visual or narrative results.
* Add support for including text (dialogue, captions) as an optional feature.
* **UI/UX Enhancements:**
* Develop a more interactive interface for viewing and arranging panels.
* Allow users to regenerate individual panels without restarting the entire process.
* Add an option to export the final comic as a PDF or other formats.
* **Accessibility Improvements:**
* Further refine the "Accessible" art style based on user feedback.
* Implement ARIA attributes and ensure full keyboard navigability for the web interface.
* Add an "image description" feature where a text-to-speech engine can describe the generated panels.
* **Documentation:**
* Create a detailed API reference for developers looking to build on the platform.
* Write user-facing guides on how to get the best results from the generator.
---
## 12. Contact / Ownership Info
* **Source Code:** [https://github.com/yourusername/Comic-Story-Generator](https://github.com/yourusername/Comic-Story-Generator)
* **License:** This project is licensed under the **MIT License**. For full details, see the `LICENSE` file in the repository.
* **Primary Contact:** [Info Needed: Add primary maintainer's name and contact information (e.g., GitHub handle or email).]