Spaces:

Compumacy
/

Hekaya7

Sleeping

App Files Files Community

Hekaya7 / handover.md

XA7

First

e037628 about 2 months ago

preview code

raw

history blame

11.8 kB

	# Comic Story Generator: Code Handover Document

	Date: 2025-7-22
	Document Purpose: This document provides a comprehensive technical handover for the Comic Story Generator project. It is intended for developers and future maintainers responsible for the deployment, maintenance, and extension of the application.

	---

	## 1. Project Overview

	The Comic Story Generator is a web application that automatically creates multi-page, textless comic stories from a user-provided description. The application leverages generative AI to produce visually coherent narratives, focusing on character consistency, expressive emotion, and logical panel sequencing.

	### 1.1. Core Functionality

	The application is designed to translate a textual story concept into a purely visual comic strip. Key characteristics include:

	* AI-Powered Narrative: Utilizes Google's Gemini to interpret the user's concept and break it down into a structured, panel-by-panel narrative.
	* Visual Generation: Employs a GPT-based image model to render complete comic pages based on the AI-generated narrative structure.
	* Intelligent Panel Detection: Uses Gemini Vision to analyze the generated full-page image and accurately detect the boundaries of each panel, ensuring precise splitting.
	* Customization: Offers users control over the output, including:
	* Layout: Choice of panel count (from 4 to 24).
	* Length: Generation of 1 to 10 pages.
	* Art Style: A selection of visual styles, including "Classic Comic," "Manga," "Cartoon," "Digital Paint," and a high-contrast "Accessible" style designed for users with special needs.

	### 1.2. High-Level Workflow

	The generation process follows a clear, multi-step pipeline:

	1. User Input: The user submits a short description of the desired story.
	2. Story Generation: The `StoryGenerator` component uses Gemini to create a detailed, scene-by-scene description for each comic panel.
	3. Page Generation: The `ComicGenerator` takes the panel descriptions and instructs the GPT-Image model to generate a single, composite image representing a full comic page with panels arranged in a grid.
	4. Layout Analysis: The generated page is passed to the `GeminiVision` component, which analyzes the image to identify the precise coordinates and boundaries of each panel.
	5. Panel Splitting: The application uses the coordinates from the vision analysis to accurately split the composite image into individual panel images.
	6. Final Output: The processed panels are presented to the user as a complete, multi-page visual story.

	---

	## 2. System Architecture

	The application is built on a modular architecture composed of three primary classes, each responsible for a distinct part of the generation pipeline.

	### 2.1. System Diagram

	```mermaid
	classDiagram
	class StoryGenerator{
	+generate_story(description: string) : list[string]
	+enhance_visuals(panel_descriptions: list) : list[string]
	}
	class ComicGenerator{
	+generate_page(panel_descriptions: list) : Image
	+split_panels(page_image: Image, grid_layout: dict) : list[Image]
	}
	class GeminiVision{
	+analyze_layout(page_image: Image) : dict
	}

	StoryGenerator "1" -- "1" ComicGenerator : Provides panel descriptions
	ComicGenerator "1" -- "1" GeminiVision : Uses for layout analysis
	```

	### 2.2. Data Flow

	The end-to-end data flow illustrates the interaction between the user, the application, and the underlying AI models.

	```mermaid
	sequenceDiagram
	participant User
	participant App
	participant Gemini as Gemini (Text/Story)
	participant GPTImage as GPT-Image (Visuals)
	participant GeminiVision as Gemini Vision (Analysis)

	User->>+App: Submits story description
	App->>+Gemini: Requests story structure from description
	Gemini-->>-App: Returns panel-by-panel text descriptions
	App->>+GPTImage: Requests comic page generation from descriptions
	GPTImage-->>-App: Returns single full-page image
	App->>+GeminiVision: Requests layout analysis of the image
	GeminiVision-->>-App: Returns coordinates of each panel
	App->>User: Displays final, split-panel comic
	```

	---

	## 3. Setup & Installation

	### 3.1. Prerequisites

	* Python: Version 3.9 or higher.
	* API Keys:
	* An active OpenAI API key.
	* An active Google API key with access to the Gemini family of models.

	### 3.2. Installation Steps

	1. Clone the Repository:
	```bash
	git clone https://github.com/yourusername/Comic-Story-Generator.git
	cd Comic-Story-Generator
	```

	2. Create and Activate a Virtual Environment:
	```bash
	# Create the environment
	python -m venv venv

	# Activate the environment (macOS/Linux)
	source venv/bin/activate

	# Or, activate on Windows
	# venv\Scripts\activate
	```

	3. Install Dependencies:
	```bash
	pip install -r requirements.txt
	```

	4. Configure Environment Variables:
	Create a `.env` file in the project root and add your API keys.
	```bash
	echo "OPENAI_API_KEY=your_openai_key" > .env
	echo "GOOGLE_API_KEY=your_google_key" >> .env
	```
	Note: Ensure the `.env` file is added to your `.gitignore` file to prevent committing secrets.

	---

	## 4. Environment Variables / Secrets

	The application requires the following environment variables to be set in a `.env` file at the project's root.

	\| Variable \| Description \| Required \| Example \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| `OPENAI_API_KEY` \| API key for the OpenAI service, used for GPT-Image generation. \| Yes \| `sk-xxxxxxxxxxxxxxxxxxxxxxxx` \|
	\| `GOOGLE_API_KEY` \| API key for Google AI services, used for Gemini (story structure) and Gemini Vision (layout analysis). \| Yes \| `AIzaSyxxxxxxxxxxxxxxxxxxxxx` \|

	---

	## 5. How to Run

	After completing the setup and installation steps, launch the application with the following command from the project's root directory:

	```bash
	python app.py
	```

	The application will start a local web server, and the interface will be accessible at the URL provided in the console (typically `http://127.0.0.1:7860`).

	---

	## 6. Deployment Instructions

	[TODO] This section requires documentation for deploying the application to a production environment. Steps should include:
	* Recommended hosting provider (e.g., AWS, Heroku, DigitalOcean).
	* Instructions for setting up a production-grade web server (e.g., Gunicorn).
	* Configuration of a reverse proxy (e.g., Nginx).
	* Management of production environment variables/secrets.
	* Process management (e.g., using `systemd`).

	---

	## 7. Core Components & Logic

	The application logic is encapsulated in three main classes.

	### 7.1. `StoryGenerator`

	* Responsibility: Handles the narrative creation phase.
	* `generate_story()`: Takes the raw user description as input. It constructs a prompt for the Gemini model to elicit a structured response containing a list of detailed text descriptions, one for each comic panel.
	* `enhance_visuals()`: Processes the panel descriptions to add specific visual cues and optimizations, particularly for the "Accessible" style, ensuring high contrast and simplified object representation.

	### 7.2. `ComicGenerator`

	* Responsibility: Manages the visual generation and processing of the comic page.
	* `generate_page()`: Aggregates the panel descriptions from `StoryGenerator` into a single, complex prompt for the GPT-Image model. This prompt instructs the AI to create one composite image with all panels laid out in a grid.
	* `split_panels()`: Receives the generated page image and the layout data from `GeminiVision`. It uses this data to crop the page into individual panel images with high precision.

	### 7.3. `GeminiVision`

	* Responsibility: Performs visual analysis on the generated comic page.
	* `analyze_layout()`: This is the core of the intelligent panel-splitting feature. It takes the full-page image as input and uses the Gemini Vision model to visually identify the boundaries of each panel. It returns a dictionary containing the coordinates and dimensions of the detected grid, which is more robust than assuming a fixed grid layout.

	---

	## 8. Third-party Dependencies

	The complete list of Python packages is specified in `requirements.txt`. Key dependencies include:

	* `openai`: Python client for the OpenAI API.
	* `google-generativeai`: Python client for the Google AI (Gemini) API.
	* `python-dotenv`: For loading environment variables from the `.env` file.
	* `Pillow`: For image manipulation (cropping and saving).
	* [Info Needed]: The web framework used to build `app.py` (e.g., `gradio`, `flask`, `fastapi`).

	---

	## 9. Testing Instructions

	[TODO] A testing framework has not been established for this project. Future work should include:
	* Test Suite Setup: Choose and configure a testing framework (e.g., `pytest`).
	* Unit Tests: Create unit tests for individual methods in `StoryGenerator`, `ComicGenerator`, and `GeminiVision`. This should involve mocking the API calls to AI services to test the data processing logic in isolation.
	* Integration Tests: Develop tests for the entire generation pipeline, from user input to final split panels.
	* Continuous Integration: Set up a CI pipeline (e.g., using GitHub Actions) to run tests automatically on pull requests.

	---

	## 10. Troubleshooting & Common Issues

	[TODO] This section should be populated as common issues are identified. Potential areas to document include:
	* API Key Errors: Steps to verify that API keys are correctly configured and have the necessary permissions.
	* Incoherent Stories: Guidance on how to write effective initial descriptions to improve narrative quality.
	* Poor Panel Splitting: Troubleshooting steps for when Gemini Vision fails to detect the layout correctly (e.g., checking image complexity, trying a different art style).
	* Long Generation Times: Explanation of typical performance and factors that can cause delays (e.g., API provider latency, number of panels).

	---

	## 11. TODOs / Future Work

	Based on the project's focus areas, the following are key areas for future development and contribution:

	* Core Generation Logic:
	* Improve character consistency across multiple pages.
	* Experiment with different AI models for potentially better visual or narrative results.
	* Add support for including text (dialogue, captions) as an optional feature.
	* UI/UX Enhancements:
	* Develop a more interactive interface for viewing and arranging panels.
	* Allow users to regenerate individual panels without restarting the entire process.
	* Add an option to export the final comic as a PDF or other formats.
	* Accessibility Improvements:
	* Further refine the "Accessible" art style based on user feedback.
	* Implement ARIA attributes and ensure full keyboard navigability for the web interface.
	* Add an "image description" feature where a text-to-speech engine can describe the generated panels.
	* Documentation:
	* Create a detailed API reference for developers looking to build on the platform.
	* Write user-facing guides on how to get the best results from the generator.

	---

	## 12. Contact / Ownership Info

	* Source Code: [https://github.com/yourusername/Comic-Story-Generator](https://github.com/yourusername/Comic-Story-Generator)
	* License: This project is licensed under the MIT License. For full details, see the `LICENSE` file in the repository.
	* Primary Contact: [Info Needed: Add primary maintainer's name and contact information (e.g., GitHub handle or email).]