Ontario Tech Science vialab · Visualization for Information Analysis

Athena CTF A Modular Framework for Instructional Capture-the-Flag Challenges

Zach Frank · Supervisors: Randy Fortier, Mariana Shimabukuro

View the Demo View Code Thesis Slides Conference Submission Demo

Scroll

Abstract

Athena CTF in Brief

An open-source modular framework for the rapid creation, containerization, and deployment of web-based CTF challenges — with assessment-ready per-user parametrization and constrained LLM-assisted hints.

10lines

A fully functional level in as few as ten lines of Python

115bytes

Average per-user record size in a 15-challenge demo deployment

0PII

No personally identifying information stored by default

01

Lightweight, Modular Authoring

Each level is a self-contained Page object. Authors supply an instructions endpoint and a verify function — the framework wires routing, templates, cookies and verification.

02

Assessment-Oriented

Deterministic per-user or per-team parametrization reduces trivial answer sharing; LMS-exported rosters (CSV/JSON) provision users by a single athenaId field — no other PII is stored.

03

Constrained LLM Hints

One-shot, non-conversational hints grounded in the user's interaction history and a creator-authored solution path. Supports commercial APIs or locally hosted models.

04

Containerized Delivery

Docker images or source-repo spawning scripts. Runs on Unix or Windows, on a local classroom network or behind a reverse proxy — no VM required for learners.

Introduction

The Problem With Current CTFs

01
Levels can be difficult to build
Authoring a level often means plumbing boilerplate, managing session state, and hand-rolling templates.
02
Difficult to deploy and manage
Instructors juggle Docker, database migrations and networking across classroom infrastructure.
03
Static flags lead to cheating
Once a flag leaks on a Discord server the challenge effectively becomes a free-for-all.
04
Static or no hints hurt UX
A stuck student with no scaffolding disengages and walks away — the opposite of the intended lesson.
05
Learning ends with the competition
Once the event closes, the challenges disappear. Students lose the chance to revisit and reflect.

Related Work

Where Athena Sits

Existing platforms cluster into deployment frameworks, self-directed learning platforms, and static wargames. Athena targets the gap between "too much to configure" and "too static to assess".

Deployment Frameworks

CTFd / RootTheBox — front-end scoreboards; challenge delivery is external.
kCTF / EDURange — Kubernetes automation, but authors must write full Dockerfiles.
Labtainers — broad VM exercises, no live progress tracking.

Athena removes the custom Dockerfile step and tracks progress through an optional NoSQL store.

Learning Platforms

PicoCTF — public challenges, static hints, shared flags make grading hard.
pwn.college DOJO — in-browser VM, heavy on network and compute for web tasks.
SEED Labs — strong Linux exploitation, VM-first for even intro web labs.

Athena runs in the learner's native browser, with per-user parametrized flags for formal coursework.

Wargame Platforms

bWAPP / DVWA — no centralized tracking; one-user-per-instance at best.
OWASP Juice Shop — tracks progress, but adding new levels requires deep code changes.

Athena's Page abstraction lets instructors author and register a new challenge without touching core logic.

Framework

Web-Based, Modular Architecture

Athena is an entirely web-native stack built on FastAPI and Jinja2 templates, with an optional MongoDB layer for tracking and a separately-deployed admin container.

End-to-End, Optional Where It Counts

Challenges can be distributed as Docker images through registries like Docker Hub, or as source repositories with provided spawning scripts. Both Unix and Windows hosts are supported. Learners interact through their native browser — no VM, no extensions, no preconfigured environment.

The optional database layer enables learner tracking, LMS-based provisioning, and LLM-assisted hints. Disabling it yields a fully standalone deployment suitable for demos or offline workshops.

Architecture Overview

Levels are Page objects registered at import time. A shared application core handles routing, sessions, templates and the verification pipeline; the database connector and LLM connector are both optional and swap-in/swap-out at deploy time.

In a 15-challenge demo, the average user record was ~115 bytes with an index of roughly 1 MB across ~11,000 users — small enough for free-tier cloud Mongo or a single classroom VM.

Framework

Constrained, Context-Driven Hints

One-shot plain-text hints generated on explicit user request — not an open-ended chat agent. The output space is bounded to reduce prompt injection, over-scaffolding, and solution leakage.

Grounded in Two Controlled Sources

User interaction history — prior requests and failed submissions, pulled from the centralized DB when enabled.
Creator-authored solution path — a write-up or executable script shipped with the level, loaded into memory at runtime.

The model is guided toward the intended solving strategy and prevented from inventing alternative paths.

Non-Conversational & Stateless

Each request is independent — no dialogue memory, no chained turns. This keeps inference costs predictable for classroom-scale deployments and sharply limits the attack surface for prompt injection.

Commercial APIs (Claude · ChatGPT) Local Models (Ollama, any) Static Hints (no LLM)

Admins choose the model, the context volume, or disable LLM hints entirely for assessments.

Framework

Modular Level Creation

Each level is a self-contained Page object. Authors provide an instructions endpoint (what the learner sees) and a verify endpoint (boolean correctness, optional error code) — the framework does the rest.

Jinja2 shared templates — a small set of layouts every level draws from, so styling stays consistent and white-labelling is trivial.
Server-side helper components — generate_button, FormGroup, FormData, tables, accordions; authors compose the UI in Python.
Safe-by-default rendering — injected content is sanitized automatically; authors can opt out only where a vulnerability genuinely requires raw HTML.
Auto-registration — Page objects register with the app at import time, giving every level uniform ordering, nav, and verification semantics.
Optional DB hookup — levels run standalone; enabling the centralized Mongo store unlocks tracking, hints, and LMS export with no code changes.

minimal_level.py · a full level in ~10 lines

from app.utils.page import Page
from app.utils.extensions import templates
from config import get_config

config = get_config()
tiny = Page("Tiny Level")

async def verify(request):
    data = await request.json()
    return {"success": data.get("flag") == "HELLO_WORLD"}

async def instructions(request):
    return templates.TemplateResponse(request=request,
        name=config.TEMPLATE,
        context={"text": "Submit the flag HELLO_WORLD"})

tiny.set_functions(verify=verify, instructions=instructions)

Framework

Deterministic Parametrization

Each level carries a level-specific secret generated at creation time. At verify time, that secret is combined with a user identifier — either an anonymous cookie or an LMS-provisioned athenaId — to compute a unique expected solution per participant.

Reproducible for the Learner

The same user gets the same flag on every attempt — revisiting the level, restarting the browser, or retrying after a break all produce identical solutions. Nothing needs to be memorized across sessions.

Invalid When Shared

A flag pasted into a Discord server won't verify for anyone else — it was deterministically derived from their cookie plus the level secret, and the next participant gets a different expected value.

Uniform Code Path

The parametrization function runs on every verification, even when disabled. In static mode it simply returns the level-wide solution. No divergent branches, no duplicated logic, no drift between assessment and demo deployments.

Admin-Configurable

Disable parametrization per deployment (for in-class demos and collaborative exercises) or selectively enable it on specific levels for graded assignments — all without modifying challenge code.

Verification Flow

User cookie + level code → deterministic expected flag → comparison with submitted flag.

Framework

Ease of Assessment

The administration interface is a separately deployed container. Assessment workflows are configured independently of challenge logic — the same challenge image serves demos, collaborative labs, and graded assignments.

Containerized & Reproducible

Challenges ship as Docker images via Docker Hub, or as source repos with spawning scripts
Consistent behaviour across Unix and Windows hosts — no VM required for learners
Local network, classroom LAN, or public via reverse proxy — same image
Kubernetes-based per-user isolation is a planned extension

LMS-Based Provisioning

Default: anonymous cookie-based identifier, no PII
Formal mode: upload CSV or JSON from your LMS — only the athenaId field is stored
Same identifier can be shared across a team for collective progress
Export interaction data and rejoin with the LMS export at grading time

https://athena-ctf.local/admin

Athena CTF admin interface — user provisioning and result export

Evaluation

Expert Evaluation

45min

Pair-programming session

0docs

Participants read no documentation beforehand

2experts

One TA and one professor

Positive Comments

The framework was simple to use
Yielded clean and useable results
Provided adequate tools for production use

Suggestions

Docstrings on framework internals COMPLETE
Restructure framework for easier level addition COMPLETE
Add more helper functions for things like logins PARTIAL

Evaluation

In-Lab Study

Builder

TA built 3 levels of varying difficulty. Documentation was provided — no pair programming this time.

Positive Comments

Level building was simple and fast
Documentation, including docstrings, was good

Students

Students completed pre- and post-surveys in the final week of labs.

Positive Comments

Overall, students preferred Athena to other platforms
More engaging and easier to follow
Hints provided a good way to get help

Suggestions

Hints could be more specific
Hint load times were too long

Ethical Considerations

Designed for Responsible Use

Teaching exploitation means exposing learners to real techniques. Athena's defaults are deliberately conservative.

Responsible Disclosure

All publicly released sample challenges are scoped to already-documented vulnerability classes. No zero-days, no pending disclosures — the platform never contributes to the spread of unreported flaws.

Academic Integrity

Per-user and per-team parametrized flags prevent trivial answer sharing. Instructors may selectively enable or disable this per level to fit collaborative labs versus graded work.

Constrained LLM Assistance

Hints are non-conversational, context-constrained, and grounded in creator-authored solution paths — reducing the risk of hallucinated guidance, full-answer leakage, and prompt-injection abuse.

Privacy by Default

Stored records contain only an anonymized identifier, challenge interaction metadata, and limited request history needed for hints or assessment. No demographic or personally identifying data is required.

Evaluation

Future Work

More LLM options

Extend the hint class to support additional providers — starting with Gemini.

More helper functions

Grow the authoring library so common patterns like auth and paging are a one-liner.

Creating more levels

Expand the shipped catalog so instructors can run a full semester out-of-the-box.

Better hint prompts

Iterate on prompt engineering to make hints sharper and more useful for stuck students.

Kubernetes deployments

Add Kubernetes manifests for safer, more scalable classroom and event deployments.

Athena CTF A Modular Framework for Instructional Capture-the-Flag Challenges

Athena CTF in Brief

Lightweight, Modular Authoring

Assessment-Oriented

Constrained LLM Hints

Containerized Delivery

The Problem With Current CTFs

Levels can be difficult to build

Difficult to deploy and manage

Static flags lead to cheating

Static or no hints hurt UX

Learning ends with the competition

Where Athena Sits

Deployment Frameworks

Learning Platforms

Wargame Platforms

Web-Based, Modular Architecture

End-to-End, Optional Where It Counts

Architecture Overview

Constrained, Context-Driven Hints

Grounded in Two Controlled Sources

Non-Conversational & Stateless

Modular Level Creation

Deterministic Parametrization

Reproducible for the Learner

Invalid When Shared

Uniform Code Path

Admin-Configurable

Verification Flow

Ease of Assessment

Containerized & Reproducible

LMS-Based Provisioning

Expert Evaluation

Positive Comments

Suggestions

In-Lab Study

Builder

Positive Comments

Students

Positive Comments

Suggestions

Designed for Responsible Use

Responsible Disclosure

Academic Integrity

Constrained LLM Assistance

Privacy by Default

Future Work

More LLM options

More helper functions

Creating more levels

Better hint prompts

Kubernetes deployments