Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

App Files Files Community

msse-ai-engineering / project-prompt-and-rubric.md

Seth McKnight

Add CI/CD workflow and Dockerfile for application deployment (#2)

c4b28eb 2 months ago

preview code

raw

history blame

12.1 kB

	AI Engineering Project
	Project Overview
	For this project, you will be designing, building, and evaluating a Retrieval-Augmented
	Generation (RAG) LLM-based application that answers user questions about a corpus of
	company policies & procedures. You will then deploy the application to a free-tier host
	(e.g., Render, Railway) with a basic CI/CD pipeline (e.g., GitHub Actions) that triggers
	deployment on push/PR when the app builds successfully. Finally, you will demonstrate
	the system via a screen-share video showing key features of your deployed application,
	and a quick walkthrough of your design, evaluation and CI/CD run. You can complete this
	project either individually or as a group of no more than three people.
	While you can fully hand code this project if you wish, you are highly encouraged to
	utilize leading AI code generation models/AI IDEs/async agents to assist in rapidly
	producing your solution, being sure to describe in broad terms how you made use of
	them. Here are some examples of very useful AI tools you may wish to consider. You will
	be graded on the quality and functionality of the application and how well it meets the
	project requirements—no given proportion of the code is required to be hand coded.

	Learning Outcomes

	When completed successfully, this project will enable you to:
	● Demonstrate excellent AI engineering skills
	● Demonstrate the ability to select appropriate AI application design and
	architecture
	● Implement a working LLM-based application including RAG
	● Evaluate the performance of an LLM-based application
	● Utilize AI tooling as appropriate

	Project Description

	First, assemble a small but coherent corpus of documents outlining company policies &
	procedures—about 5–20 short markdown/HTML/PDF/TXT files totaling 30–120 pages.
	You may author them yourself (with AI assistance) or use policies that you are aware of
	from your own organization that can be used for this assignment. Students must use a
	corpus they can legally include in the repo or load at runtime (e.g., your own synthetic
	policies, your organization’s employee policy documents etc.)—no private/paid data is
	required. Additionally, you should define success metrics for your application (see the
	“Evaluation” step below), including at least one information-quality metric (e.g.,
	groundedness or citation accuracy) and one system metric (e.g., latency).
	Use free or zero-cost options when possible e.g., OpenRouter’s free tier
	(https://openrouter.ai/docs/api-reference/limits), Groq
	(https://console.groq.com/docs/rate-limits), or your own paid API keys if you have them.
	For embedding models, free-tier options are available from Cohere, Voyage,
	HuggingFace and others
	Complete the following steps to fully develop, deploy, and evaluate your application:

	Environment and Reproducibility
	○ Create a virtual environment (e.g., venv, conda).
	○ List dependencies in requirements.txt (or environment.yml).
	○ Provide a README.md with setup + run instructions.
	○ Set fixed seeds where/if applicable (for deterministic chunking or
	evaluation sampling).
	Ingestion and Indexing
	○ Parse & clean documents (handle PDFs/HTML/md/txt).
	○ Chunk documents (e.g., by headings or token windows with overlap).
	○ Embed chunks with a free embedding model or a free-tier API.
	○ Store the embedded document chunks in a local or lightweight vector
	database (e.g. Chroma or an optionally cloud-hosted vector store like
	Pinecone, etc.)
	○ Store vectors in a local/vector DB or cloud DB (e.g., Chroma, Pinecone, etc.)
	Retrieval and Generation (RAG)
	○ To build your RAG pipeline you may use frameworks such as LangChain to
	handle retrieval, prompt chaining, and API calls, or implement these
	manually.
	○ Implement Top-k retrieval with optional re-ranking.
	○ Build a prompting strategy that injects retrieved chunks (and
	citations/sources) into the LLM context.
	○ Add basic guardrails:
	■ Refuse to answer outside the corpus (“I can only answer about our
	policies”),
	■ Limit output length,
	■ Always cite source doc IDs/titles for answers.
	Web Application
	○ Students can use Flask, Streamlit or alernative for the Web app. LangChain
	is recommended for orchestration, but is optional.
	○ Endpoints/UI:
	■ / - Web chat interface (text box for user input)
	■ /chat - API endpoint that receives user questions (POST) and returns
	model-generated answers with citations and snippets (link to source
	and show snippet).
	■ /health - returns simple status via JSON.
	Deployment
	○ For production hosting use Render or Railway free tiers; students may
	alternatively use any other free-tier providers of their choice.
	○ Configure environment variables (e.g. API keys, model endpoints, DB
	related etc.).
	○ Ensure the app is publicly accessible at a shareable URL.
	CI/CD
	○ Minimal automated testing is sufficient for this assignment (a build/run
	check, optional smoke test).
	○ Create a GitHub Actions workflow that on push/PR :
	■ Installs dependencies,
	■ Runs a build/start check (e.g., python -m pip install -r
	requirements.txt and python -c "import app" or pytest -q if you add
	tests),
	■ On success in main, deploy to your host (Render/Railway action or
	via webhook/API).
	Evaluation of the LLM Application
	○ Provide a small evaluation set of 15–30 questions covering various policy
	topics (PTO, security, expense, remote work, holidays, etc.). Report:
	■ Answer Quality (required):
	1. Groundedness: % of answers whose content is factually
	consistent with and fully supported by the retrieved
	evidence—i.e., the answer contains no information that is
	absent or contradicted in the context.
	Citation Accuracy: % of answers whose listed citations
	correctly point to the specific passage(s) that support the
	information stated—i.e., the attribution is correct and not
	misleading.
	Exact/Partial Match (optional): % of answers that exactly or
	partially match a short gold answer you provide.
	■ System Metrics (required):
	Latency (p50/p95) from request to answer for 10–20 queries.
	■ Ablations (optional): compare retrieval k, chunk size, or prompt
	variants.
	Design Documentation
	○ Briefly justify design choices (embedding model, chunking, k, prompt
	format, vector store).
	Submission Guidelines

	Your final submission should consist of two links:
	● A link to an accessible software repository (a GitHub repo) containing all your
	developed code. You must share your repository with the GitHub account,
	quantic-grader.
	o The GitHub repository should include a link to the deployed version of
	your RAG LLM-based application (in file deployed.md)
	o The GitHub repository must include a README.md file indicating setup and
	run instructions
	o The GitHub repository must also include a brief design and evaluation
	document (design-and-evaluation.md) listing and explaining:
	i) design and architecture decisions made - and why they were made,
	including technology choices
	ii) summary of your evaluation of your RAG system
	● A link to a recorded screen-share demonstration video of the working RAG
	LLM-based application, involving screen capture of it being used with voiceover
	o All group members must speak and be present on camera.
	o All group members must show their government ID.
	o The demonstration/presentation should be between 5 and 10 minutes long.
	To submit your project, please click on the "Submit Project" button on your dashboard
	and follow the steps provided. If you are submitting your project as a group, please
	ensure only ONE member submits on behalf of the group. Please reach out to
	[email protected] if you have any questions. Project grading typically takes

	about 3-4 weeks to complete after the submission due date. There is no score penalty
	for projects submitted after the due date, however grading may be delayed.

	Plagiarism Policy

	Here at Quantic, we believe that learning is best accomplished by “doing”—this ethos
	underpinned the design of our active learning platform, and it likewise informs our
	approach to the completion of projects and presentations for our degree programs. We
	expect that all of our graduates will be able to deploy the concepts and skills they’ve
	learned over the course of their degree, whether in the workplace or in pursuit of
	personal goals, and so it is in our students’ best interest that these assignments be
	completed solely through their own efforts with academic integrity.
	Quantic takes academic integrity very seriously—we define plagiarism as: “Knowingly
	representing the work of others as one’s own, engaging in any acts of plagiarism, or
	referencing the works of others without appropriate citation.” This includes both misusing
	or not using proper citations for the works referenced, and submitting someone else’s
	work as your own. Quantic monitors all submissions for instances of plagiarism and all
	plagiarism, even unintentional, is considered a conduct violation. If you’re still not sure
	about what constitutes plagiarism, check out this two-minute presentation by our
	librarian, Kristina. It is important to be conscientious when citing your sources. When in
	doubt, cite! Kristina outlines the basics of best citation practices in this one-minute video.
	You can also find more about our plagiarism policy here.

	Project Rubric
	Scores 2 and above are considered passing. Students who receive a 1 or 0 will not get
	credit for the assignment and must revise and resubmit to receive a passing grade.
	Score Description

	5
	● Addresses ALL of the project requirements, but not limited to:
	○ Outstanding RAG application with correct responses with matching
	citations, ingest and indexing works
	○ Excellent, well-structured application architecture
	○ Public deployment on Render, Railway (or equivalent) fully functional
	○ CI/CD runs on push/PR and deploys on success
	○ Excellent documentation of design choices.
	○ Excellent evaluation results, which includes groundedness, citation
	accuracy, and latency
	○ Excellent, clear demo of features, design and evaluation
	4
	● Addresses MOST of the project requirements, but not limited to:
	○ Excellent RAG application with correct responses with generally
	matching citations, ingest and indexing works
	○ Very good, well-structured application architecture
	○ Public deployment on Render, Railway (or equivalent) almost fully
	functional
	○ CI/CD runs on push/PR and deploys on success
	○ Very good documentation of design choices.
	○ Very good evaluation results which includes groundedness, citation
	accuracy, and latency
	○ Very good, clear demo of features, design and evaluation
	3
	● Addresses SOME of the project requirements, but not limited to:
	○ Very good RAG application with mainly correct responses with
	generally matching citations, ingest and indexing works
	○ Good, well-structured application architecture
	○ Public deployment on Render, Railway (or equivalent) almost fully
	functional
	○ CI/CD runs on push/PR and deploys on success
	○ Good documentation of design choices.
	○ Good evaluation results which includes most of groundedness,
	citation accuracy, and latency
	○ Good, clear demo of features, design and evaluation.
	2
	● Addresses FEW of the project requirements, but not limited to:
	○ Passable RAG application with limited correct responses with few
	matching citations, ingest and indexing works partially
	○ Passable application architecture
	○ Public deployment on Render, Railway (or equivalent) not fully
	functional
	○ CI/CD runs on push/PR and deploys on success
	○ Passable documentation of design choices.
	○ Passable evaluation results which includes only some of
	groundedness, citation accuracy, and latency
	○ Passable demo of features, design and evaluation
	1
	● Addresses the project but MOST of the project requirements are missing,
	but not limited to:
	○ Incomplete app; not deployed,
	○ No CI/CD,
	○ No to very limited evaluation
	○ No design documentation
	○ No demo of application
	0
	● The student either did not complete the assignment, plagiarized all or part
	of the assignment, or completely failed to address the project requirements.