| # Comic Story Generator: Code Handover Document | |
| **Date:** 2025-7-22 | |
| **Document Purpose:** This document provides a comprehensive technical handover for the Comic Story Generator project. It is intended for developers and future maintainers responsible for the deployment, maintenance, and extension of the application. | |
| --- | |
| ## 1. Project Overview | |
| The Comic Story Generator is a web application that automatically creates multi-page, textless comic stories from a user-provided description. The application leverages generative AI to produce visually coherent narratives, focusing on character consistency, expressive emotion, and logical panel sequencing. | |
| ### 1.1. Core Functionality | |
| The application is designed to translate a textual story concept into a purely visual comic strip. Key characteristics include: | |
| * **AI-Powered Narrative:** Utilizes Google's Gemini to interpret the user's concept and break it down into a structured, panel-by-panel narrative. | |
| * **Visual Generation:** Employs a GPT-based image model to render complete comic pages based on the AI-generated narrative structure. | |
| * **Intelligent Panel Detection:** Uses Gemini Vision to analyze the generated full-page image and accurately detect the boundaries of each panel, ensuring precise splitting. | |
| * **Customization:** Offers users control over the output, including: | |
| * **Layout:** Choice of panel count (from 4 to 24). | |
| * **Length:** Generation of 1 to 10 pages. | |
| * **Art Style:** A selection of visual styles, including "Classic Comic," "Manga," "Cartoon," "Digital Paint," and a high-contrast "Accessible" style designed for users with special needs. | |
| ### 1.2. High-Level Workflow | |
| The generation process follows a clear, multi-step pipeline: | |
| 1. **User Input:** The user submits a short description of the desired story. | |
| 2. **Story Generation:** The `StoryGenerator` component uses Gemini to create a detailed, scene-by-scene description for each comic panel. | |
| 3. **Page Generation:** The `ComicGenerator` takes the panel descriptions and instructs the GPT-Image model to generate a single, composite image representing a full comic page with panels arranged in a grid. | |
| 4. **Layout Analysis:** The generated page is passed to the `GeminiVision` component, which analyzes the image to identify the precise coordinates and boundaries of each panel. | |
| 5. **Panel Splitting:** The application uses the coordinates from the vision analysis to accurately split the composite image into individual panel images. | |
| 6. **Final Output:** The processed panels are presented to the user as a complete, multi-page visual story. | |
| --- | |
| ## 2. System Architecture | |
| The application is built on a modular architecture composed of three primary classes, each responsible for a distinct part of the generation pipeline. | |
| ### 2.1. System Diagram | |
| ```mermaid | |
| classDiagram | |
| class StoryGenerator{ | |
| +generate_story(description: string) : list[string] | |
| +enhance_visuals(panel_descriptions: list) : list[string] | |
| } | |
| class ComicGenerator{ | |
| +generate_page(panel_descriptions: list) : Image | |
| +split_panels(page_image: Image, grid_layout: dict) : list[Image] | |
| } | |
| class GeminiVision{ | |
| +analyze_layout(page_image: Image) : dict | |
| } | |
| StoryGenerator "1" -- "1" ComicGenerator : Provides panel descriptions | |
| ComicGenerator "1" -- "1" GeminiVision : Uses for layout analysis | |
| ``` | |
| ### 2.2. Data Flow | |
| The end-to-end data flow illustrates the interaction between the user, the application, and the underlying AI models. | |
| ```mermaid | |
| sequenceDiagram | |
| participant User | |
| participant App | |
| participant Gemini as Gemini (Text/Story) | |
| participant GPTImage as GPT-Image (Visuals) | |
| participant GeminiVision as Gemini Vision (Analysis) | |
| User->>+App: Submits story description | |
| App->>+Gemini: Requests story structure from description | |
| Gemini-->>-App: Returns panel-by-panel text descriptions | |
| App->>+GPTImage: Requests comic page generation from descriptions | |
| GPTImage-->>-App: Returns single full-page image | |
| App->>+GeminiVision: Requests layout analysis of the image | |
| GeminiVision-->>-App: Returns coordinates of each panel | |
| App->>User: Displays final, split-panel comic | |
| ``` | |
| --- | |
| ## 3. Setup & Installation | |
| ### 3.1. Prerequisites | |
| * **Python:** Version 3.9 or higher. | |
| * **API Keys:** | |
| * An active OpenAI API key. | |
| * An active Google API key with access to the Gemini family of models. | |
| ### 3.2. Installation Steps | |
| 1. **Clone the Repository:** | |
| ```bash | |
| git clone https://github.com/yourusername/Comic-Story-Generator.git | |
| cd Comic-Story-Generator | |
| ``` | |
| 2. **Create and Activate a Virtual Environment:** | |
| ```bash | |
| # Create the environment | |
| python -m venv venv | |
| # Activate the environment (macOS/Linux) | |
| source venv/bin/activate | |
| # Or, activate on Windows | |
| # venv\Scripts\activate | |
| ``` | |
| 3. **Install Dependencies:** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 4. **Configure Environment Variables:** | |
| Create a `.env` file in the project root and add your API keys. | |
| ```bash | |
| echo "OPENAI_API_KEY=your_openai_key" > .env | |
| echo "GOOGLE_API_KEY=your_google_key" >> .env | |
| ``` | |
| *Note: Ensure the `.env` file is added to your `.gitignore` file to prevent committing secrets.* | |
| --- | |
| ## 4. Environment Variables / Secrets | |
| The application requires the following environment variables to be set in a `.env` file at the project's root. | |
| | Variable | Description | Required | Example | | |
| | :--- | :--- | :--- | :--- | | |
| | `OPENAI_API_KEY` | API key for the OpenAI service, used for GPT-Image generation. | Yes | `sk-xxxxxxxxxxxxxxxxxxxxxxxx` | | |
| | `GOOGLE_API_KEY` | API key for Google AI services, used for Gemini (story structure) and Gemini Vision (layout analysis). | Yes | `AIzaSyxxxxxxxxxxxxxxxxxxxxx` | | |
| --- | |
| ## 5. How to Run | |
| After completing the setup and installation steps, launch the application with the following command from the project's root directory: | |
| ```bash | |
| python app.py | |
| ``` | |
| The application will start a local web server, and the interface will be accessible at the URL provided in the console (typically `http://127.0.0.1:7860`). | |
| --- | |
| ## 6. Deployment Instructions | |
| [TODO] This section requires documentation for deploying the application to a production environment. Steps should include: | |
| * Recommended hosting provider (e.g., AWS, Heroku, DigitalOcean). | |
| * Instructions for setting up a production-grade web server (e.g., Gunicorn). | |
| * Configuration of a reverse proxy (e.g., Nginx). | |
| * Management of production environment variables/secrets. | |
| * Process management (e.g., using `systemd`). | |
| --- | |
| ## 7. Core Components & Logic | |
| The application logic is encapsulated in three main classes. | |
| ### 7.1. `StoryGenerator` | |
| * **Responsibility:** Handles the narrative creation phase. | |
| * **`generate_story()`:** Takes the raw user description as input. It constructs a prompt for the Gemini model to elicit a structured response containing a list of detailed text descriptions, one for each comic panel. | |
| * **`enhance_visuals()`:** Processes the panel descriptions to add specific visual cues and optimizations, particularly for the "Accessible" style, ensuring high contrast and simplified object representation. | |
| ### 7.2. `ComicGenerator` | |
| * **Responsibility:** Manages the visual generation and processing of the comic page. | |
| * **`generate_page()`:** Aggregates the panel descriptions from `StoryGenerator` into a single, complex prompt for the GPT-Image model. This prompt instructs the AI to create one composite image with all panels laid out in a grid. | |
| * **`split_panels()`:** Receives the generated page image and the layout data from `GeminiVision`. It uses this data to crop the page into individual panel images with high precision. | |
| ### 7.3. `GeminiVision` | |
| * **Responsibility:** Performs visual analysis on the generated comic page. | |
| * **`analyze_layout()`:** This is the core of the intelligent panel-splitting feature. It takes the full-page image as input and uses the Gemini Vision model to visually identify the boundaries of each panel. It returns a dictionary containing the coordinates and dimensions of the detected grid, which is more robust than assuming a fixed grid layout. | |
| --- | |
| ## 8. Third-party Dependencies | |
| The complete list of Python packages is specified in `requirements.txt`. Key dependencies include: | |
| * **`openai`**: Python client for the OpenAI API. | |
| * **`google-generativeai`**: Python client for the Google AI (Gemini) API. | |
| * **`python-dotenv`**: For loading environment variables from the `.env` file. | |
| * **`Pillow`**: For image manipulation (cropping and saving). | |
| * **[Info Needed]**: The web framework used to build `app.py` (e.g., `gradio`, `flask`, `fastapi`). | |
| --- | |
| ## 9. Testing Instructions | |
| [TODO] A testing framework has not been established for this project. Future work should include: | |
| * **Test Suite Setup:** Choose and configure a testing framework (e.g., `pytest`). | |
| * **Unit Tests:** Create unit tests for individual methods in `StoryGenerator`, `ComicGenerator`, and `GeminiVision`. This should involve mocking the API calls to AI services to test the data processing logic in isolation. | |
| * **Integration Tests:** Develop tests for the entire generation pipeline, from user input to final split panels. | |
| * **Continuous Integration:** Set up a CI pipeline (e.g., using GitHub Actions) to run tests automatically on pull requests. | |
| --- | |
| ## 10. Troubleshooting & Common Issues | |
| [TODO] This section should be populated as common issues are identified. Potential areas to document include: | |
| * **API Key Errors:** Steps to verify that API keys are correctly configured and have the necessary permissions. | |
| * **Incoherent Stories:** Guidance on how to write effective initial descriptions to improve narrative quality. | |
| * **Poor Panel Splitting:** Troubleshooting steps for when Gemini Vision fails to detect the layout correctly (e.g., checking image complexity, trying a different art style). | |
| * **Long Generation Times:** Explanation of typical performance and factors that can cause delays (e.g., API provider latency, number of panels). | |
| --- | |
| ## 11. TODOs / Future Work | |
| Based on the project's focus areas, the following are key areas for future development and contribution: | |
| * **Core Generation Logic:** | |
| * Improve character consistency across multiple pages. | |
| * Experiment with different AI models for potentially better visual or narrative results. | |
| * Add support for including text (dialogue, captions) as an optional feature. | |
| * **UI/UX Enhancements:** | |
| * Develop a more interactive interface for viewing and arranging panels. | |
| * Allow users to regenerate individual panels without restarting the entire process. | |
| * Add an option to export the final comic as a PDF or other formats. | |
| * **Accessibility Improvements:** | |
| * Further refine the "Accessible" art style based on user feedback. | |
| * Implement ARIA attributes and ensure full keyboard navigability for the web interface. | |
| * Add an "image description" feature where a text-to-speech engine can describe the generated panels. | |
| * **Documentation:** | |
| * Create a detailed API reference for developers looking to build on the platform. | |
| * Write user-facing guides on how to get the best results from the generator. | |
| --- | |
| ## 12. Contact / Ownership Info | |
| * **Source Code:** [https://github.com/yourusername/Comic-Story-Generator](https://github.com/yourusername/Comic-Story-Generator) | |
| * **License:** This project is licensed under the **MIT License**. For full details, see the `LICENSE` file in the repository. | |
| * **Primary Contact:** [Info Needed: Add primary maintainer's name and contact information (e.g., GitHub handle or email).] |