Skip to main content
Digital Fluency

Building the Digital Fluency Platform

Technical approach

Purpose of this document

This document explains how the Digital Fluency platform is built and why each architectural choice was made. The intended reader is a collaborator, contractor, or technical reviewer who wants to understand what we are committing to before writing code.

The architecture is constrained by three things, in this order:

  1. The pedagogy (pedagogy.md) — the five design moves and the productive-struggle / metacognitive-overload findings dictate what the system must observe, what it must respond to, and how fast.
  2. The current state of AI agent technology (mid-2026) — what frontier vision-based "computer-use" agents can and cannot do, and where they will plausibly be in 12 months.
  3. Realistic build cost — what's available open-source, what must be built, what should be deferred to v2.

This is a living document. Field research with digital-skills providers (fieldwork.md) will reshape several sections, particularly the AI co-pilot's intervention specifics and the curriculum-content assumptions.


1. Why simulated, not vision-agent overlay

The single biggest architectural decision is whether to build a simulated environment (we own the canvas, we own the state) or a vision-agent overlay (a Claude / Operator / Gemini computer-use agent watches the user navigate real websites and apps).

We commit to the simulated environment. The case rests on four findings.

1.1 Vision-based computer-use agents cannot meet the latency budget for real-time coaching

Current vision agents take 2–7 seconds per step (screenshot → API call → inference → action). For a real-time tutoring use case — "user makes a mistake, coach intervenes within 1–2 seconds" — that is 4–8x too slow. By the time the agent has analyzed the screen, the user has already moved on or compounded the error.

An instrumented sandbox responds in <10ms because we already know the state — every input field, every focus event, every keystroke pause is local. There is no inference step.

For the Bastani-style engagement-mediated learning that our pedagogy depends on, low-latency feedback is not a nice-to-have. The mechanism is "watch the user attempt, intervene when productive struggle becomes distress." A 5-second latency means the intervention lands after the moment is over.

1.2 The benchmark numbers do not transfer to production

The research synthesis (saved at research/summaries/instrumented-vs-vision-telemetry.md and the computer-use-reliability findings in chat) shows a sharp gap:

The benchmark-vs-production gap is not narrowing because production sites adversarially evolve against scrapers and agents (CAPTCHAs, bot detection, dynamic DOM, geo-fencing). Benchmarks don't have those defenses; the real web does.

A learning product whose coaching layer breaks on a meaningful fraction of real sites is not a learning product.

1.3 Vision agents miss the cognitive signals our pedagogy depends on

Even when vision agents correctly identify on-screen state, they cannot see:

These are exactly the signals Bastani's RL-driven tutor used to build a richer "knowledge state" than the binary correct/incorrect signals BKT can offer. They are invisible to vision. They are trivial to capture in an instrumented sandbox.

1.4 Build-on-top libraries close most of the gap on the simulated side

The library-landscape research (saved at research/summaries/browser-desktop-simulation-ecosystem.md) found that ~50–60% of the simulated-environment build can be assembled from MIT-licensed open-source projects:

The remaining 40–50% is custom — the email client mock, the AI tutor integration layer, and the pedagogical telemetry — but that work is unavoidable in either architecture.

Conclusion

Vision-agent overlay would lock us into a slow, unreliable, telemetry-poor coaching layer running against a defended, adversarial web. A simulated sandbox lets us own the latency budget, the signal richness, and the curriculum surface. The 12-month forecast for vision agents (per the agent-reliability research) is incremental improvement on benchmarks but flat on production-site reliability — the gap is structural, not temporary.

This decision is revisable. If, by 2027, frontier computer-use agents reach <500ms latency on real sites with >90% reliability, we should re-evaluate. We do not believe that is plausible on this horizon.


2. Architecture overview

Digital Fluency platform architecture A simulated desktop in the browser feeds a telemetry layer, which streams structured events to a backend task engine, tutor orchestrator, and knowledge state store, calling the Anthropic API for tutor responses. Browser · Single page Simulated Desktop Shell daedalOS-derived Browser Email Docs Forms Files Virtual Filesystem · ZenFS AI Co-Pilot side panel • Hint button • Pattern name • Debrief questions • Demo (rare) Telemetry Layer — see §3 keystroke timing · dwell · focus · undo · paste partial input · hover · scroll · attempt count structured event stream Backend · Task engine + tutor Task Engine state machine · far-transfer selection · contrasting cases Tutor Orchestrator struggle detector · intervention rules · prompt construction Anthropic API Sonnet for reasoning · Haiku for nudges Knowledge State Store • Per-user pattern mastery • Near / far transfer scores • Engagement metrics
System architecture. Browser hosts the simulated desktop and AI co-pilot; both surface every interaction through a shared telemetry layer. The backend owns task selection, tutor orchestration, and the per-user knowledge state.

What lives in the browser

What lives on the backend

What we deliberately don't build


3. Instrumentation / telemetry layer

The telemetry layer is the architectural choice that makes the rest of the system possible. It is also the largest piece of custom build effort.

What we capture

For every interactive element (input, button, link, draggable, focusable), we capture:

Why each signal matters

Each signal maps to a pedagogical observation:

Signal What it tells us
Inter-key pauses > 2s Cognitive load — composing a thought, not typing it
Dwell-before-click > 3s Uncertainty — considering an option, not committing
Hover-then-no-click Considered and rejected; useful for contrasting-case design
Partial input + window switch Hit a roadblock that needs information from elsewhere
Repeated undo on same edit Mental model not yet stable; correct → revert → correct loop
Paste origin = AI panel User is offloading to AI rather than thinking
Time-since-last-action > 30s Stuck. Threshold for considering intervention.
Repeated wrong-target click Possibly a UI confusion, possibly conceptual confusion — the orchestrator must distinguish

These signals are precisely the ones Bastani et al. used to build their RL knowledge-state estimator — and the ones invisible to vision agents.

Storage and privacy

Events stream to the backend in batches (~5s windows) and persist as a per-user event log. PII is minimal — no real names, no real email content, no real-world account information; the simulated environment never connects to the real web. Event logs are retained for the duration of an active learning relationship plus a research-purpose retention window to be defined in the privacy policy.

What we don't capture


4. AI co-pilot integration

The co-pilot is where the pedagogy meets the LLM API. Its design is constrained by three findings:

  1. Bastani 2026: Engagement-mediated gains came from a chatbot prompted to refuse direct answers until students demonstrated substantial effort. Our co-pilot must do the same.
  2. McCarthy 2018 "metacognitive overload": Reflective prompts that work for confident learners hurt low-confidence learners when mis-timed. Our population is low-confidence by definition.
  3. Schwartz & Bransford 1998: Telling without readiness produces memorization, not transfer. Intervention timing is everything.

Intervention rules (v1)

The orchestrator decides whether to intervene based on signals from the telemetry layer. The default is silence — the co-pilot does not speak unless one of these triggers fires:

When a trigger fires, the orchestrator decides what to do — see prompt-construction below.

When the user explicitly asks for help (clicks the help button), the co-pilot acknowledges but does not immediately provide an answer. It offers a graduated escalation: re-state the goal → ask what the user has tried → ask what they think the next step might be → offer a hint that names a relevant pattern → demonstrate (rare, last resort).

Prompt construction

Every co-pilot response is built from three components, sent to the model with prompt caching enabled (the system prompt is the cacheable portion):

  1. System prompt (cached): the co-pilot's identity, voice, the refuse-until-effort rule, the pattern vocabulary, the list of forbidden moves (no "simply", no "just", no "obviously"), and the response-format schema.
  2. User context (mostly cached, refreshed per session): the user's known mastered patterns, current curriculum level, any relevant struggle history, and the current task description.
  3. Current event window (uncached): the last ~30 seconds of telemetry events plus the trigger that fired.

The model is asked to produce a response in a structured schema:

Model choice

Forbidden moves (the "what the co-pilot must never do" list)

Drawn from pedagogy.md and from what we expect the field research (fieldwork.md) to reinforce:


5. State estimation for adaptive sequencing

Bastani et al.'s system uses a particle-filtering knowledge-state estimator with model-predictive control to select the next problem. That's a research-grade approach. For our v1, a simpler estimator is sufficient.

v1: LLM-as-judge over interaction history

This is a heuristic system, not an optimal one. Its main virtue is that it's debuggable — every selection decision can be traced to a specific judgment.

v2: Bayesian Knowledge Tracing or Deep Knowledge Tracing

Once we have telemetry on enough users (~500–1,000 active), we have enough data to fit a proper knowledge-tracing model. The classic options:

The choice between these is empirical and should be made when we have data — premature commitment is wasted effort.

Why not Bastani-style RL at v1

Particle-filtering + MPC is real engineering work and requires:

We get there in v2 if and only if the heuristic system shows clear evidence of being a bottleneck. Most likely it will not be the bottleneck — content quality and intervention design will be.


6. Cost model

Honest answer: uncertain, but tractable. The dominant cost per active user is LLM inference; everything else (hosting, storage, telemetry) is small.

Token-cost estimate (v1, per active user-hour)

Assumptions:

If we move high-frequency low-stakes calls to Haiku (10x cheaper for those), per-hour cost drops to **$0.05–0.10**.

For a 5-month learning relationship at 2 hours/week: 40 hours × $0.10 ≈ $4 per user. This is roughly an order of magnitude lower than human-tutored alternatives and a factor of 5–10 cheaper than the OpenAI Operator coaching scenario the agent-reliability research priced out ($1.25–10 per session).

Where uncertainty lives

Where caching matters most

The system prompt (co-pilot identity, voice, rules, pattern taxonomy) is the largest cacheable block. Stabilizing it is high-priority engineering work — every change to the system prompt invalidates the cache for active users.


7. Build phases

v0: proof of concept (4–6 weeks, single engineer)

Goal: validate that we can instrument a sandbox, route telemetry to an LLM-driven co-pilot, and produce a coherent pattern-naming intervention. Not a learning product.

Success criterion: a developer (not a target user) can run the task, get plausible co-pilot interventions at appropriate moments, and the telemetry log captures the events that justified those interventions.

v1: MVP (4–6 months, small team)

Goal: a deployable learning product for a constrained pilot cohort.

Success criterion: defined by the partner-pilot deal terms (TBD per fieldwork.md Phase 3 outcomes).

v2: adaptive (deferred — 6+ months post-v1)

Goal: replace heuristic state estimation with a learned model; expand curriculum.


8. Open technical questions

These are the questions we don't yet have a confident answer to. They are deferred, not ignored.

8.1 Sandbox-to-real transfer measurement

The pedagogy commits to far-transfer assessment in novel contexts. The question is: do those novel contexts have to include real applications (real Gmail, real Google Docs), or can they be sufficiently varied within the simulated environment to constitute genuine transfer?

The aviation/medical-simulator literature suggests functional fidelity > physical fidelity — a flight simulator that captures the right decision points produces better real-world transfer than a high-fidelity replica that doesn't. Plausibly the same holds for our domain. But it is empirical and we have not tested it. The v2 question is whether to add a real-app transfer instrument (perhaps via vision agent at that stage), or whether varied surface forms within the sim are sufficient.

8.2 Cold-start state estimation

A new user has no telemetry history. The LLM-as-judge estimator can only judge after at least one task. How do we pick the first task? Likely: a brief diagnostic micro-task (~3 minutes) with predetermined difficulty calibration that produces a coarse initial mastery vector. The diagnostic itself is a content design problem, not a technical one.

8.3 Multilingual scope

Our v1 target is English-language users. The Urban Institute brief flags non-English speakers as a population providers struggle to serve. Adding multilingual support to the simulated apps is moderate engineering effort; adding it to the co-pilot is essentially free at the LLM layer (Claude is strong multilingual). The hard part is the curriculum — task design that works across languages and reading levels. Defer to v2 unless field research surfaces it as a v1 distribution requirement.

8.4 The "third-level digital divide" structural problem

Per research/summaries/adult-ct-and-digital-skills-transfer.md: training transfer is dilutied when learners can't practice at home (no internet, no device). Our product cannot solve this from inside the browser. The v2 question is whether to partner with hardware-and-internet distribution programs (KEYSPOT, hotspot loans through libraries), or to optimize for the case where the user does have access. v1 assumes access; v2 may need to address it.

8.5 Engagement architecture

Pure individual? Cohort-based? Hybrid? The Bastani RCT was effectively individual-with-instructor-context (the Python course had teachers). Whether sustained adult engagement requires a peer or instructor layer is exactly the field-research question (fieldwork.md Q3). Architecture differs significantly between the options. Defer the choice until field research reports.


How this connects back

This doc will need substantial revision after Phase 2 of the field-research program. The intervention timing rules in §4, in particular, are educated guesses — they should be tightened against observed instructor behavior before any code commits to them.