Comic Story Generator: Code Handover Document
Date: 2025-7-22 Document Purpose: This document provides a comprehensive technical handover for the Comic Story Generator project. It is intended for developers and future maintainers responsible for the deployment, maintenance, and extension of the application.
1. Project Overview
The Comic Story Generator is a web application that automatically creates multi-page, textless comic stories from a user-provided description. The application leverages generative AI to produce visually coherent narratives, focusing on character consistency, expressive emotion, and logical panel sequencing.
1.1. Core Functionality
The application is designed to translate a textual story concept into a purely visual comic strip. Key characteristics include:
- AI-Powered Narrative: Utilizes Google's Gemini to interpret the user's concept and break it down into a structured, panel-by-panel narrative.
- Visual Generation: Employs a GPT-based image model to render complete comic pages based on the AI-generated narrative structure.
- Intelligent Panel Detection: Uses Gemini Vision to analyze the generated full-page image and accurately detect the boundaries of each panel, ensuring precise splitting.
- Customization: Offers users control over the output, including:
- Layout: Choice of panel count (from 4 to 24).
- Length: Generation of 1 to 10 pages.
- Art Style: A selection of visual styles, including "Classic Comic," "Manga," "Cartoon," "Digital Paint," and a high-contrast "Accessible" style designed for users with special needs.
1.2. High-Level Workflow
The generation process follows a clear, multi-step pipeline:
- User Input: The user submits a short description of the desired story.
- Story Generation: The
StoryGeneratorcomponent uses Gemini to create a detailed, scene-by-scene description for each comic panel. - Page Generation: The
ComicGeneratortakes the panel descriptions and instructs the GPT-Image model to generate a single, composite image representing a full comic page with panels arranged in a grid. - Layout Analysis: The generated page is passed to the
GeminiVisioncomponent, which analyzes the image to identify the precise coordinates and boundaries of each panel. - Panel Splitting: The application uses the coordinates from the vision analysis to accurately split the composite image into individual panel images.
- Final Output: The processed panels are presented to the user as a complete, multi-page visual story.
2. System Architecture
The application is built on a modular architecture composed of three primary classes, each responsible for a distinct part of the generation pipeline.
2.1. System Diagram
classDiagram
class StoryGenerator{
+generate_story(description: string) : list[string]
+enhance_visuals(panel_descriptions: list) : list[string]
}
class ComicGenerator{
+generate_page(panel_descriptions: list) : Image
+split_panels(page_image: Image, grid_layout: dict) : list[Image]
}
class GeminiVision{
+analyze_layout(page_image: Image) : dict
}
StoryGenerator "1" -- "1" ComicGenerator : Provides panel descriptions
ComicGenerator "1" -- "1" GeminiVision : Uses for layout analysis
2.2. Data Flow
The end-to-end data flow illustrates the interaction between the user, the application, and the underlying AI models.
sequenceDiagram
participant User
participant App
participant Gemini as Gemini (Text/Story)
participant GPTImage as GPT-Image (Visuals)
participant GeminiVision as Gemini Vision (Analysis)
User->>+App: Submits story description
App->>+Gemini: Requests story structure from description
Gemini-->>-App: Returns panel-by-panel text descriptions
App->>+GPTImage: Requests comic page generation from descriptions
GPTImage-->>-App: Returns single full-page image
App->>+GeminiVision: Requests layout analysis of the image
GeminiVision-->>-App: Returns coordinates of each panel
App->>User: Displays final, split-panel comic
3. Setup & Installation
3.1. Prerequisites
- Python: Version 3.9 or higher.
- API Keys:
- An active OpenAI API key.
- An active Google API key with access to the Gemini family of models.
3.2. Installation Steps
Clone the Repository:
git clone https://github.com/yourusername/Comic-Story-Generator.git cd Comic-Story-GeneratorCreate and Activate a Virtual Environment:
# Create the environment python -m venv venv # Activate the environment (macOS/Linux) source venv/bin/activate # Or, activate on Windows # venv\Scripts\activateInstall Dependencies:
pip install -r requirements.txtConfigure Environment Variables: Create a
.envfile in the project root and add your API keys.echo "OPENAI_API_KEY=your_openai_key" > .env echo "GOOGLE_API_KEY=your_google_key" >> .envNote: Ensure the
.envfile is added to your.gitignorefile to prevent committing secrets.
4. Environment Variables / Secrets
The application requires the following environment variables to be set in a .env file at the project's root.
| Variable | Description | Required | Example |
|---|---|---|---|
OPENAI_API_KEY |
API key for the OpenAI service, used for GPT-Image generation. | Yes | sk-xxxxxxxxxxxxxxxxxxxxxxxx |
GOOGLE_API_KEY |
API key for Google AI services, used for Gemini (story structure) and Gemini Vision (layout analysis). | Yes | AIzaSyxxxxxxxxxxxxxxxxxxxxx |
5. How to Run
After completing the setup and installation steps, launch the application with the following command from the project's root directory:
python app.py
The application will start a local web server, and the interface will be accessible at the URL provided in the console (typically http://127.0.0.1:7860).
6. Deployment Instructions
[TODO] This section requires documentation for deploying the application to a production environment. Steps should include:
- Recommended hosting provider (e.g., AWS, Heroku, DigitalOcean).
- Instructions for setting up a production-grade web server (e.g., Gunicorn).
- Configuration of a reverse proxy (e.g., Nginx).
- Management of production environment variables/secrets.
- Process management (e.g., using
systemd).
7. Core Components & Logic
The application logic is encapsulated in three main classes.
7.1. StoryGenerator
- Responsibility: Handles the narrative creation phase.
generate_story(): Takes the raw user description as input. It constructs a prompt for the Gemini model to elicit a structured response containing a list of detailed text descriptions, one for each comic panel.enhance_visuals(): Processes the panel descriptions to add specific visual cues and optimizations, particularly for the "Accessible" style, ensuring high contrast and simplified object representation.
7.2. ComicGenerator
- Responsibility: Manages the visual generation and processing of the comic page.
generate_page(): Aggregates the panel descriptions fromStoryGeneratorinto a single, complex prompt for the GPT-Image model. This prompt instructs the AI to create one composite image with all panels laid out in a grid.split_panels(): Receives the generated page image and the layout data fromGeminiVision. It uses this data to crop the page into individual panel images with high precision.
7.3. GeminiVision
- Responsibility: Performs visual analysis on the generated comic page.
analyze_layout(): This is the core of the intelligent panel-splitting feature. It takes the full-page image as input and uses the Gemini Vision model to visually identify the boundaries of each panel. It returns a dictionary containing the coordinates and dimensions of the detected grid, which is more robust than assuming a fixed grid layout.
8. Third-party Dependencies
The complete list of Python packages is specified in requirements.txt. Key dependencies include:
openai: Python client for the OpenAI API.google-generativeai: Python client for the Google AI (Gemini) API.python-dotenv: For loading environment variables from the.envfile.Pillow: For image manipulation (cropping and saving).- [Info Needed]: The web framework used to build
app.py(e.g.,gradio,flask,fastapi).
9. Testing Instructions
[TODO] A testing framework has not been established for this project. Future work should include:
- Test Suite Setup: Choose and configure a testing framework (e.g.,
pytest). - Unit Tests: Create unit tests for individual methods in
StoryGenerator,ComicGenerator, andGeminiVision. This should involve mocking the API calls to AI services to test the data processing logic in isolation. - Integration Tests: Develop tests for the entire generation pipeline, from user input to final split panels.
- Continuous Integration: Set up a CI pipeline (e.g., using GitHub Actions) to run tests automatically on pull requests.
10. Troubleshooting & Common Issues
[TODO] This section should be populated as common issues are identified. Potential areas to document include:
- API Key Errors: Steps to verify that API keys are correctly configured and have the necessary permissions.
- Incoherent Stories: Guidance on how to write effective initial descriptions to improve narrative quality.
- Poor Panel Splitting: Troubleshooting steps for when Gemini Vision fails to detect the layout correctly (e.g., checking image complexity, trying a different art style).
- Long Generation Times: Explanation of typical performance and factors that can cause delays (e.g., API provider latency, number of panels).
11. TODOs / Future Work
Based on the project's focus areas, the following are key areas for future development and contribution:
- Core Generation Logic:
- Improve character consistency across multiple pages.
- Experiment with different AI models for potentially better visual or narrative results.
- Add support for including text (dialogue, captions) as an optional feature.
- UI/UX Enhancements:
- Develop a more interactive interface for viewing and arranging panels.
- Allow users to regenerate individual panels without restarting the entire process.
- Add an option to export the final comic as a PDF or other formats.
- Accessibility Improvements:
- Further refine the "Accessible" art style based on user feedback.
- Implement ARIA attributes and ensure full keyboard navigability for the web interface.
- Add an "image description" feature where a text-to-speech engine can describe the generated panels.
- Documentation:
- Create a detailed API reference for developers looking to build on the platform.
- Write user-facing guides on how to get the best results from the generator.
12. Contact / Ownership Info
- Source Code: https://github.com/yourusername/Comic-Story-Generator
- License: This project is licensed under the MIT License. For full details, see the
LICENSEfile in the repository. - Primary Contact: [Info Needed: Add primary maintainer's name and contact information (e.g., GitHub handle or email).]