Athena CTF A Modular Framework for Instructional Capture-the-Flag Challenges
Zach Frank · Supervisors: Randy Fortier, Mariana Shimabukuro
Athena CTF in Brief
An open-source modular framework for the rapid creation, containerization, and deployment of web-based CTF challenges — with assessment-ready per-user parametrization and constrained LLM-assisted hints.
Lightweight, Modular Authoring
Each level is a self-contained Page object. Authors supply an instructions endpoint and a verify function — the framework wires routing, templates, cookies and verification.
Assessment-Oriented
Deterministic per-user or per-team parametrization reduces trivial answer sharing; LMS-exported rosters (CSV/JSON) provision users by a single athenaId field — no other PII is stored.
Constrained LLM Hints
One-shot, non-conversational hints grounded in the user's interaction history and a creator-authored solution path. Supports commercial APIs or locally hosted models.
Containerized Delivery
Docker images or source-repo spawning scripts. Runs on Unix or Windows, on a local classroom network or behind a reverse proxy — no VM required for learners.
The Problem With Current CTFs
- 01
Levels can be difficult to build
Authoring a level often means plumbing boilerplate, managing session state, and hand-rolling templates.
- 02
Difficult to deploy and manage
Instructors juggle Docker, database migrations and networking across classroom infrastructure.
- 03
Static flags lead to cheating
Once a flag leaks on a Discord server the challenge effectively becomes a free-for-all.
- 04
Static or no hints hurt UX
A stuck student with no scaffolding disengages and walks away — the opposite of the intended lesson.
- 05
Learning ends with the competition
Once the event closes, the challenges disappear. Students lose the chance to revisit and reflect.
Web-Based, Modular Architecture
Athena is an entirely web-native stack built on FastAPI and Jinja2 templates, with an optional MongoDB layer for tracking and a separately-deployed admin container.
End-to-End, Optional Where It Counts
Challenges can be distributed as Docker images through registries like Docker Hub, or as source repositories with provided spawning scripts. Both Unix and Windows hosts are supported. Learners interact through their native browser — no VM, no extensions, no preconfigured environment.
The optional database layer enables learner tracking, LMS-based provisioning, and LLM-assisted hints. Disabling it yields a fully standalone deployment suitable for demos or offline workshops.
Architecture Overview
Levels are Page objects registered at import time. A shared application core handles routing, sessions, templates and the verification pipeline; the database connector and LLM connector are both optional and swap-in/swap-out at deploy time.
In a 15-challenge demo, the average user record was ~115 bytes with an index of roughly 1 MB across ~11,000 users — small enough for free-tier cloud Mongo or a single classroom VM.
Constrained, Context-Driven Hints
One-shot plain-text hints generated on explicit user request — not an open-ended chat agent. The output space is bounded to reduce prompt injection, over-scaffolding, and solution leakage.
Grounded in Two Controlled Sources
- User interaction history — prior requests and failed submissions, pulled from the centralized DB when enabled.
- Creator-authored solution path — a write-up or executable script shipped with the level, loaded into memory at runtime.
The model is guided toward the intended solving strategy and prevented from inventing alternative paths.
Non-Conversational & Stateless
Each request is independent — no dialogue memory, no chained turns. This keeps inference costs predictable for classroom-scale deployments and sharply limits the attack surface for prompt injection.
Admins choose the model, the context volume, or disable LLM hints entirely for assessments.
Modular Level Creation
Each level is a self-contained Page object. Authors provide an instructions endpoint (what the learner sees) and a verify endpoint (boolean correctness, optional error code) — the framework does the rest.
- Jinja2 shared templates — a small set of layouts every level draws from, so styling stays consistent and white-labelling is trivial.
- Server-side helper components —
generate_button,FormGroup,FormData, tables, accordions; authors compose the UI in Python. - Safe-by-default rendering — injected content is sanitized automatically; authors can opt out only where a vulnerability genuinely requires raw HTML.
- Auto-registration — Page objects register with the app at import time, giving every level uniform ordering, nav, and verification semantics.
- Optional DB hookup — levels run standalone; enabling the centralized Mongo store unlocks tracking, hints, and LMS export with no code changes.
from app.utils.page import Page
from app.utils.extensions import templates
from config import get_config
config = get_config()
tiny = Page("Tiny Level")
async def verify(request):
data = await request.json()
return {"success": data.get("flag") == "HELLO_WORLD"}
async def instructions(request):
return templates.TemplateResponse(request=request,
name=config.TEMPLATE,
context={"text": "Submit the flag HELLO_WORLD"})
tiny.set_functions(verify=verify, instructions=instructions)
Deterministic Parametrization
Each level carries a level-specific secret generated at creation time. At verify time, that secret is combined with a user identifier — either an anonymous cookie or an LMS-provisioned athenaId — to compute a unique expected solution per participant.
Reproducible for the Learner
The same user gets the same flag on every attempt — revisiting the level, restarting the browser, or retrying after a break all produce identical solutions. Nothing needs to be memorized across sessions.
Invalid When Shared
A flag pasted into a Discord server won't verify for anyone else — it was deterministically derived from their cookie plus the level secret, and the next participant gets a different expected value.
Uniform Code Path
The parametrization function runs on every verification, even when disabled. In static mode it simply returns the level-wide solution. No divergent branches, no duplicated logic, no drift between assessment and demo deployments.
Admin-Configurable
Disable parametrization per deployment (for in-class demos and collaborative exercises) or selectively enable it on specific levels for graded assignments — all without modifying challenge code.
Verification Flow
User cookie + level code → deterministic expected flag → comparison with submitted flag.
Ease of Assessment
The administration interface is a separately deployed container. Assessment workflows are configured independently of challenge logic — the same challenge image serves demos, collaborative labs, and graded assignments.
Containerized & Reproducible
- Challenges ship as Docker images via Docker Hub, or as source repos with spawning scripts
- Consistent behaviour across Unix and Windows hosts — no VM required for learners
- Local network, classroom LAN, or public via reverse proxy — same image
- Kubernetes-based per-user isolation is a planned extension
LMS-Based Provisioning
- Default: anonymous cookie-based identifier, no PII
- Formal mode: upload CSV or JSON from your LMS — only the
athenaIdfield is stored - Same identifier can be shared across a team for collective progress
- Export interaction data and rejoin with the LMS export at grading time
Expert Evaluation
Positive Comments
- The framework was simple to use
- Yielded clean and useable results
- Provided adequate tools for production use
Suggestions
- Docstrings on framework internals COMPLETE
- Restructure framework for easier level addition COMPLETE
- Add more helper functions for things like logins PARTIAL
In-Lab Study
Builder
TA built 3 levels of varying difficulty. Documentation was provided — no pair programming this time.
Positive Comments
- Level building was simple and fast
- Documentation, including docstrings, was good
Students
Students completed pre- and post-surveys in the final week of labs.
Positive Comments
- Overall, students preferred Athena to other platforms
- More engaging and easier to follow
- Hints provided a good way to get help
Suggestions
- Hints could be more specific
- Hint load times were too long
Designed for Responsible Use
Teaching exploitation means exposing learners to real techniques. Athena's defaults are deliberately conservative.
Responsible Disclosure
All publicly released sample challenges are scoped to already-documented vulnerability classes. No zero-days, no pending disclosures — the platform never contributes to the spread of unreported flaws.
Academic Integrity
Per-user and per-team parametrized flags prevent trivial answer sharing. Instructors may selectively enable or disable this per level to fit collaborative labs versus graded work.
Constrained LLM Assistance
Hints are non-conversational, context-constrained, and grounded in creator-authored solution paths — reducing the risk of hallucinated guidance, full-answer leakage, and prompt-injection abuse.
Privacy by Default
Stored records contain only an anonymized identifier, challenge interaction metadata, and limited request history needed for hints or assessment. No demographic or personally identifying data is required.
Future Work
More LLM options
Extend the hint class to support additional providers — starting with Gemini.
More helper functions
Grow the authoring library so common patterns like auth and paging are a one-liner.
Creating more levels
Expand the shipped catalog so instructors can run a full semester out-of-the-box.
Better hint prompts
Iterate on prompt engineering to make hints sharper and more useful for stuck students.
Kubernetes deployments
Add Kubernetes manifests for safer, more scalable classroom and event deployments.