Audio summary · companion notes

Architecting the GRAPE MARS AI roadmap

A reference companion to the recorded discussion — what was covered, section by section.

GRAPE MARS is a hosted web platform built on Cloudflare. Researchers log in, create projects, and upload video; that video is stored in Cloudflare R2 and the project data and annotations live in a Cloudflare D1 database, with the interface served from Cloudflare Pages and background tasks running on Cloudflare Workers. One AI capability is live today — Whisper transcription — and it runs server-side on Cloudflare Workers AI. The annotation editor itself runs in the browser, with real-time collaboration.

This recording is a roughly twenty-minute conversation between two hosts about what GRAPE MARS could add next: which forms of AI annotation to integrate, in what order, and — the question they keep returning to — whether each should run on Cloudflare's servers or in the researcher's own browser. It is descriptive of choices the team faces, not a set of conclusions.

Use this page while you listen. The contents below mirror the order of the conversation; each entry carries the timestamp where that part begins. Selecting a row moves the audio to that moment and scrolls to the matching notes, and the notes highlight themselves as the recording plays.

00:00

The stakes: a confidently wrong assistant

The discussion opens with an image: a polished, confident research paper whose underlying dataset turns out to have been fabricated by someone who simply didn't want to leave blanks on a spreadsheet. That, the hosts suggest, is the risk GRAPE MARS takes on as it adds AI — a model can behave like an over-confident assistant that would rather invent an answer than admit it doesn't know. The recording is framed directly to the research team building and using the platform, and sets out to examine the foundational choices ahead before any new capability is switched on.

00:54

Where the platform stands today

Before the choices, the hosts establish the baseline, and they are explicit that this is not a prototype: real research is already being done on real recordings. The platform is built entirely on Cloudflare. Uploaded video lives in Cloudflare R2 storage; project data and annotations live in a relational database, Cloudflare D1; the interface is served from Cloudflare Pages; and background tasks run on Cloudflare Workers. The single AI capability turned on today is Whisper transcription, and it runs server-side on Cloudflare's Workers AI — the only thing happening locally in the browser is the responsive, collaborative annotation editor. Their summary of the current reality: the video, the data, and the processing all live on Cloudflare.

02:37

The architectural crossroads

With the platform proven, the conversation turns to expansion — gesture tracking, gaze analysis, facial expression, and more. The hosts are clear that the question is no longer whether to add capabilities but which ones, in what sequence, and — the point they treat as most consequential — where the computation should actually run. Because everything today relies on server-side processing, the easy assumption is to keep adding new models in the cloud. The recording flags this as the single most important decision the team is standing in front of, because it shapes cost, privacy, and the feel of the tool all at once.

03:25

The case for running in the browser

The surprise the hosts highlight: roughly half of the capabilities the team wants could be built to run client-side — in the browser, on the researcher's own laptop — using technologies that let modern browsers reach the machine's graphics processor. They describe three benefits. Cost: work that runs locally adds nothing to the server bill, so hundreds of hours could be analysed for effectively zero marginal cost. Immediacy: instead of sending a clip to a queue and waiting, analysis could update live as a researcher scrubs the timeline. Privacy: for that locally-run work, the video would never leave the researcher's machine — a meaningful protection for recordings of vulnerable subjects.

"This local, client-side processing is a choice ahead of you — a potential direction to build toward. It is absolutely not how the platform operates or protects your data today. Cloudflare is the reality today."

The hosts stop to make this precise: client-side processing is an option on the menu for the future, not a current feature. Today, as established, the data is on Cloudflare.

06:33

Sprinters and heavyweights

The hosts sort the wished-for capabilities into two groups by what a browser can physically handle. The "sprinters" are light enough to run locally; the "heavyweights" are too mathematically dense and need server hardware. Tracking a hand across a screen, they note, is close to a reflex — mapping two-dimensional points frame by frame. Understanding that a hand waved, set down a cup, and came to rest over a continuous span requires holding a great deal of context over time, which is a different order of computation.

Could run client-side

light enough for the browser

Speech work
Prosody — rhythm, stress, pitch of the voice
Basic audio handling — separating silence from speech
Foundational pose, hand and face tracking

Require the server

too dense for the browser

Action recognition over long stretches of time
Multi-camera tracking of the same person
Overlapping sound-event detection in noisy rooms
Gaze tracking — where a subject is looking in 3-D

09:35

The rollout, and the budget underneath it

Because the sprinters are essentially free to run on local hardware while the heavyweights are billed by the millisecond of server time, the hosts describe the roadmap as a budget question as much as a technical one. The proposed sequence deliberately front-loads the cheap, private, foundational work and defers the expensive models — partly to do valuable research immediately without opening the cloud chequebook, and partly because the cost of the heavy models tends to fall year on year, so patience is itself a saving.

Pose, hand and face tracking — foundational to most gesture research, cheap to run, and high-privacy if built client-side.
Speaker diarisation — layered onto the existing server-side Whisper transcript to identify who spoke when.
Facial expression analysis — moving into the more expensive territory.
Gaze tracking — a heavyweight, deferred until its value is clear.
Shot detection — automatic segmentation of long footage.

The hosts stress this order is a draft, not a directive: research priorities should set it, and a team with a funded project blocked on gaze should say so and reorder.

10:41

The "94% accuracy" trap

Whether a model is free or expensive, the hosts argue, none of it matters if the output is quietly wrong — which leads to their warning about benchmark numbers. A figure like "94% accuracy" should be read as a theoretical ceiling reached under pristine lab conditions: clean audio, studio lighting, still subjects facing the camera, and training data drawn almost entirely from adult English speakers. The footage GRAPE MARS actually handles is the opposite — overlapping Valencian speech in echoing rooms, spontaneous gestures, people turning from the camera, and, crucially, children, who move unpredictably and speak at different pitches. Dropped into that, they note, a model's accuracy doesn't gently dip; its comprehension can collapse.

12:33

Confident failure, and the validation mandate

The danger the hosts emphasise is not the model that fails loudly — one that crashes and says "I don't know what I'm looking at" is annoying but safe, because it forces a human to step in. The dangerous one is the student who, rather than leaving the page blank, writes a confident, beautifully formatted essay that is simply wrong: it makes a hallucinated guess, stamps it with a high confidence score, and outputs a clean spreadsheet that looks empirical. Pasted into a study, those invisible errors contaminate the dataset. The stated defence is a non-negotiable rule: no new capability is turned on by default until a specific, named member of the team has tested it against the platform's own messy, real-world Valencian footage.

14:16

Feature-rich versus honest

That risk pushes the hosts to a question about what kind of tool GRAPE MARS should be: feature-rich or honest. A feature-rich tool promises total automation and asks the researcher to think less. An honest tool behaves with humility — it treats every automated output as a rough draft, openly demands that the researcher verify it, and flags its own low-confidence moments rather than smoothing them over behind a clean interface. The argument rests on credibility: in academia, a tool caught hallucinating data into a published paper loses trust permanently. An honest platform, they add, treats biometric-adjacent inferences — emotion, gaze, internal states — as highly sensitive, which is part of the case for leaning into client-side processing where it is possible.

15:47

The six decisions the team is asked to make

The recording closes its substance by distilling everything into six choices the hosts say the researchers — not the developers — have to drive. They are presented here as a set you could print and bring to a meeting.

Decision 1

Licensing and data ownership

Use the highest-accuracy, state-of-the-art models that may carry restrictive commercial licences dictating how data is used — or accept a slightly lower accuracy ceiling for open models that keep data fully under the team's control.

At stake: control over how participant data may be used and stored.

Decision 2

Age and stability of models

Accept "frozen" models — stable and functional today but abandoned by their original developers and receiving no updates — or mandate only actively-maintained models, which narrows the options but keeps the tech evolving.

At stake: who maintains a dependency if it breaks later.

Decision 3

Prioritising the 18 capability areas

The sprinters are cheaper and easier, but the science should decide. Rank the eighteen identified capabilities by genuine, immediate research need rather than by what is technically convenient.

At stake: whether the roadmap follows the research or the engineering.

Decision 4

Locking the integration sequence

Formally agree the order — foundational pose and face tracking first to buy time before the heavyweights — so engineering effort is focused rather than split across local and server builds at once.

At stake: focused effort versus scattered half-builds.

Decision 5

Ownership of validation

Assign a specific named owner and a hard deadline for testing each model against real, unscripted footage of children. Without a name and a date, validation stays a good idea that never happens.

At stake: whether the validation mandate is real.

Decision 6

Consent and data classification

Set a formal stance on biometric-adjacent data before any participant video meets a new capability — including whether the university's data-protection office treats mapped facial geometry as personal data, and how inferred states are presented.

At stake: the ethical and legal foundation, set before code is written.

18:30

A position of strength — and a closing question

The hosts return to where the team is starting from: a solid, secure, working foundation on Cloudflare, which means these choices are being made from strength rather than to fix something broken. The task they describe is to bring specific project needs to the table, since the developers cannot guess what the science requires. They end on a reflection rather than an answer — if GRAPE MARS were built as a genuinely honest tool, one that refuses to simply hand over answers and instead forces its drafts to be checked, that friction might make for sharper, more sceptical researchers. The tool, in their framing, does not replace the researcher; it demands more of them.

Accuracy note Cross-checked against the team briefing (grape-mars-where-we-are.md), the English episode's claims align with the current architecture: it states plainly that client-side processing is a future option and that data is on Cloudflare today. No substantive contradictions were found; spoken mispronunciations of the platform name in the audio are not reflected here.