Spaces:

Compumacy
/

Hekaya7

Sleeping

App Files Files Community

Hekaya7 / handover.md

XA7

First

e037628 about 1 month ago

preview code

raw

history blame

11.8 kB

Comic Story Generator: Code Handover Document

Date: 2025-7-22 Document Purpose: This document provides a comprehensive technical handover for the Comic Story Generator project. It is intended for developers and future maintainers responsible for the deployment, maintenance, and extension of the application.

1. Project Overview

The Comic Story Generator is a web application that automatically creates multi-page, textless comic stories from a user-provided description. The application leverages generative AI to produce visually coherent narratives, focusing on character consistency, expressive emotion, and logical panel sequencing.

1.1. Core Functionality

The application is designed to translate a textual story concept into a purely visual comic strip. Key characteristics include:

AI-Powered Narrative: Utilizes Google's Gemini to interpret the user's concept and break it down into a structured, panel-by-panel narrative.
Visual Generation: Employs a GPT-based image model to render complete comic pages based on the AI-generated narrative structure.
Intelligent Panel Detection: Uses Gemini Vision to analyze the generated full-page image and accurately detect the boundaries of each panel, ensuring precise splitting.
Customization: Offers users control over the output, including:
- Layout: Choice of panel count (from 4 to 24).
- Length: Generation of 1 to 10 pages.
- Art Style: A selection of visual styles, including "Classic Comic," "Manga," "Cartoon," "Digital Paint," and a high-contrast "Accessible" style designed for users with special needs.

1.2. High-Level Workflow

The generation process follows a clear, multi-step pipeline:

User Input: The user submits a short description of the desired story.
Story Generation: The StoryGenerator component uses Gemini to create a detailed, scene-by-scene description for each comic panel.
Page Generation: The ComicGenerator takes the panel descriptions and instructs the GPT-Image model to generate a single, composite image representing a full comic page with panels arranged in a grid.
Layout Analysis: The generated page is passed to the GeminiVision component, which analyzes the image to identify the precise coordinates and boundaries of each panel.
Panel Splitting: The application uses the coordinates from the vision analysis to accurately split the composite image into individual panel images.
Final Output: The processed panels are presented to the user as a complete, multi-page visual story.

2. System Architecture

The application is built on a modular architecture composed of three primary classes, each responsible for a distinct part of the generation pipeline.

2.1. System Diagram

classDiagram
    class StoryGenerator{
        +generate_story(description: string) : list[string]
        +enhance_visuals(panel_descriptions: list) : list[string]
    }
    class ComicGenerator{
        +generate_page(panel_descriptions: list) : Image
        +split_panels(page_image: Image, grid_layout: dict) : list[Image]
    }
    class GeminiVision{
        +analyze_layout(page_image: Image) : dict
    }
    
    StoryGenerator "1" -- "1" ComicGenerator : Provides panel descriptions
    ComicGenerator "1" -- "1" GeminiVision : Uses for layout analysis

2.2. Data Flow

The end-to-end data flow illustrates the interaction between the user, the application, and the underlying AI models.

sequenceDiagram
    participant User
    participant App
    participant Gemini as Gemini (Text/Story)
    participant GPTImage as GPT-Image (Visuals)
    participant GeminiVision as Gemini Vision (Analysis)

    User->>+App: Submits story description
    App->>+Gemini: Requests story structure from description
    Gemini-->>-App: Returns panel-by-panel text descriptions
    App->>+GPTImage: Requests comic page generation from descriptions
    GPTImage-->>-App: Returns single full-page image
    App->>+GeminiVision: Requests layout analysis of the image
    GeminiVision-->>-App: Returns coordinates of each panel
    App->>User: Displays final, split-panel comic

3. Setup & Installation

3.1. Prerequisites

Python: Version 3.9 or higher.
API Keys:
- An active OpenAI API key.
- An active Google API key with access to the Gemini family of models.

3.2. Installation Steps

Clone the Repository:

git clone https://github.com/yourusername/Comic-Story-Generator.git
cd Comic-Story-Generator

Create and Activate a Virtual Environment:

# Create the environment
python -m venv venv

# Activate the environment (macOS/Linux)
source venv/bin/activate

# Or, activate on Windows
# venv\Scripts\activate

Install Dependencies:
```
pip install -r requirements.txt
```
Configure Environment Variables: Create a .env file in the project root and add your API keys.
```
echo "OPENAI_API_KEY=your_openai_key" > .env
echo "GOOGLE_API_KEY=your_google_key" >> .env
```
Note: Ensure the .env file is added to your .gitignore file to prevent committing secrets.

4. Environment Variables / Secrets

The application requires the following environment variables to be set in a .env file at the project's root.

Variable	Description	Required	Example
`OPENAI_API_KEY`	API key for the OpenAI service, used for GPT-Image generation.	Yes	`sk-xxxxxxxxxxxxxxxxxxxxxxxx`
`GOOGLE_API_KEY`	API key for Google AI services, used for Gemini (story structure) and Gemini Vision (layout analysis).	Yes	`AIzaSyxxxxxxxxxxxxxxxxxxxxx`

5. How to Run

After completing the setup and installation steps, launch the application with the following command from the project's root directory:

python app.py

The application will start a local web server, and the interface will be accessible at the URL provided in the console (typically http://127.0.0.1:7860).

6. Deployment Instructions

[TODO] This section requires documentation for deploying the application to a production environment. Steps should include:

Recommended hosting provider (e.g., AWS, Heroku, DigitalOcean).
Instructions for setting up a production-grade web server (e.g., Gunicorn).
Configuration of a reverse proxy (e.g., Nginx).
Management of production environment variables/secrets.
Process management (e.g., using systemd).

7. Core Components & Logic

The application logic is encapsulated in three main classes.

7.1. `StoryGenerator`

Responsibility: Handles the narrative creation phase.
generate_story(): Takes the raw user description as input. It constructs a prompt for the Gemini model to elicit a structured response containing a list of detailed text descriptions, one for each comic panel.
enhance_visuals(): Processes the panel descriptions to add specific visual cues and optimizations, particularly for the "Accessible" style, ensuring high contrast and simplified object representation.

7.2. `ComicGenerator`

Responsibility: Manages the visual generation and processing of the comic page.
generate_page(): Aggregates the panel descriptions from StoryGenerator into a single, complex prompt for the GPT-Image model. This prompt instructs the AI to create one composite image with all panels laid out in a grid.
split_panels(): Receives the generated page image and the layout data from GeminiVision. It uses this data to crop the page into individual panel images with high precision.

7.3. `GeminiVision`

Responsibility: Performs visual analysis on the generated comic page.
analyze_layout(): This is the core of the intelligent panel-splitting feature. It takes the full-page image as input and uses the Gemini Vision model to visually identify the boundaries of each panel. It returns a dictionary containing the coordinates and dimensions of the detected grid, which is more robust than assuming a fixed grid layout.

8. Third-party Dependencies

The complete list of Python packages is specified in requirements.txt. Key dependencies include:

openai: Python client for the OpenAI API.
google-generativeai: Python client for the Google AI (Gemini) API.
python-dotenv: For loading environment variables from the .env file.
Pillow: For image manipulation (cropping and saving).
[Info Needed]: The web framework used to build app.py (e.g., gradio, flask, fastapi).

9. Testing Instructions

[TODO] A testing framework has not been established for this project. Future work should include:

Test Suite Setup: Choose and configure a testing framework (e.g., pytest).
Unit Tests: Create unit tests for individual methods in StoryGenerator, ComicGenerator, and GeminiVision. This should involve mocking the API calls to AI services to test the data processing logic in isolation.
Integration Tests: Develop tests for the entire generation pipeline, from user input to final split panels.
Continuous Integration: Set up a CI pipeline (e.g., using GitHub Actions) to run tests automatically on pull requests.

10. Troubleshooting & Common Issues

[TODO] This section should be populated as common issues are identified. Potential areas to document include:

API Key Errors: Steps to verify that API keys are correctly configured and have the necessary permissions.
Incoherent Stories: Guidance on how to write effective initial descriptions to improve narrative quality.
Poor Panel Splitting: Troubleshooting steps for when Gemini Vision fails to detect the layout correctly (e.g., checking image complexity, trying a different art style).
Long Generation Times: Explanation of typical performance and factors that can cause delays (e.g., API provider latency, number of panels).

11. TODOs / Future Work

Based on the project's focus areas, the following are key areas for future development and contribution:

Core Generation Logic:
- Improve character consistency across multiple pages.
- Experiment with different AI models for potentially better visual or narrative results.
- Add support for including text (dialogue, captions) as an optional feature.
UI/UX Enhancements:
- Develop a more interactive interface for viewing and arranging panels.
- Allow users to regenerate individual panels without restarting the entire process.
- Add an option to export the final comic as a PDF or other formats.
Accessibility Improvements:
- Further refine the "Accessible" art style based on user feedback.
- Implement ARIA attributes and ensure full keyboard navigability for the web interface.
- Add an "image description" feature where a text-to-speech engine can describe the generated panels.
Documentation:
- Create a detailed API reference for developers looking to build on the platform.
- Write user-facing guides on how to get the best results from the generator.

12. Contact / Ownership Info

Source Code: https://github.com/yourusername/Comic-Story-Generator
License: This project is licensed under the MIT License. For full details, see the LICENSE file in the repository.
Primary Contact: [Info Needed: Add primary maintainer's name and contact information (e.g., GitHub handle or email).]