Live broadcasting hasn't really changed since 1936.
Two people in a small room, looking at a pitch through a window, talking into a microphone for ninety minutes. The microphones got better. The cameras multiplied. The graphics learned to fly. But the booth, the actual room where the match becomes a story, has remained almost exactly the same shape it had the year Berlin lit the Olympic torch and the BBC sent its first television signal to a few hundred sets in Alexandra Palace.
That's not a complaint. The booth works. It's the highest-leverage piece of any live event: a single pair of voices turning motion into meaning at the speed of attention. The reason it hasn't changed is because nothing has been good enough to change it. Until now.
Live broadcasting is the last great unautomated craft in media. It is also, by some distance, the most loved., from the desk
This is a manifesto for what happens when there is, finally, something good enough. Not to replace the booth. To fill the booths that have never had anyone in them. The 96th-minute Bundesliga 2 fixture nobody could afford to call. The college handball semifinal in Cantonese. The chess world cup in Portuguese. The earnings call narrated as it happens. The parliamentary floor in a language you don't speak. The booth that was always going to be empty, lit, and quiet. We are turning the lights on, starting with sport, because sport is the hardest thing to call well, but not stopping there.
An agent booth isn't a chatbot. It isn't a TTS. It is a working room.
The metaphor that has slowed AI for live events is the wrong one. The thing in the booth is not a "voice assistant," and it is not a script reader. It is a colleague, closer in shape to a producer, a play-by-play announcer, and a color analyst sharing a desk and an earpiece, than to anything you've talked to in a chat window.
An agent booth has four properties a chatbot does not. Persistent state, it remembers the second minute when it speaks in the eighty-ninth. Embodied roles, Alex calls; Maya analyzes; a director cuts between them; sometimes a third voice translates. Discretion, it knows when to shut up. And real-time grounding, every word it says is anchored to a specific event in a specific second of a specific match, traceable back to the data feed that authored it.
Put together: it is a booth. A small group of agents, with different jobs, sharing a live picture of the world, talking to each other and to you, on the air, in the moment.
From kick-off to your ear in 1.6 seconds.
Speed is the constraint. If a goal goes in and your booth describes it eight seconds later, the booth is broken. Not late, broken. So we built the entire system around a budget: 1.6 seconds, ball in net to first phoneme, ninety-fifth percentile.
That budget is split across five stages. Each has its own latency bound, its own fallback path, and its own correctness contract. None of them are allowed to be flaky. All of them are allowed to be wrong, briefly, in well-understood ways.
The five stages
The pipeline is, deliberately, boring. We chose dependable shapes for every step. The interesting part is what travels between them: a typed, versioned event stream that everything else in the system reads from.
- Ingest, partner data feeds and computer vision merged into a unified event stream. Events arrive irregularly; we de-jitter on a 200ms window.
- State, a deterministic match model. Score, time, possession, formation, momentum, twenty other features. Reconstructable from logs at any point.
- Script, a planning agent decides what should be said next, by whom, with what emotion, in what language. Plans are short; they're allowed to be wrong; they regenerate every tick.
- Voice, neural synthesis with prosody primitives. Volume, urgency, breath, laughter, the rising shout of a goal. Streamed in chunks.
- Embed, a single <script> tag that lives on the publisher's page. Audio plays. Captions render. State updates in real time. Done.
Why we shipped two agents instead of one.
One voice is a podcast. Two voices is a broadcast. The difference is the size of the room.
A single narrator carries everything: facts, drama, context, emotion. They have to. They're alone. So they tend toward a flat, careful affect, the audiobook problem. Two narrators get to react to each other, which is what makes a booth feel like a place. Alex calls the play; Maya hears it and replies. The match becomes a conversation about a thing happening, instead of a monologue describing a thing happening.
Two voices is not twice the talking. It is a different kind of listening., design principle
We hand-tuned the relationship. Alex is faster, terser, more in the moment, closer to the play-by-play tradition. Maya is slower, drier, deeper, closer to the analyst's chair. They overlap in the right places (a goal: both shout). They give each other room in the right places (a slow build: Alex backs off, Maya threads). The director keeps the air moving and never lets either one drone.
A live broadcast has grammar. We wrote it down.
Watch ten minutes of any decent football call with a careful ear and you'll notice it: the booth doesn't speak in sentences, it speaks in moves. There's the build, short observations stitched into a rising line. The set piece, a controlled pause where everyone waits for a free kick to be taken. The lull, a deliberate decision to be quiet for fifteen seconds because the match has gone quiet. The break, the moment a counter-attack starts and Alex's voice pulls forward. The peak, a goal, a save, a red card.
Each move has a structure, a duration, a cadence, a distribution of who speaks. We wrote down a hundred and forty of them. The script agent doesn't compose lines from scratch; it picks moves, picks a target shape for each move, and improvises within it. Like jazz musicians playing standards.
A small taxonomy
- Build, 4–18 seconds. Alex leads, Maya threads context. Ends in a transition or a peak.
- Set piece, fixed duration. Alex names the players; Maya forecasts; both fall silent at the run-up.
- Lull, 6–30 seconds of intentional quiet. The booth resists the temptation to fill.
- Break, <3 seconds to engage. Voice pitches up, sentences shorten, Maya yields to Alex.
- Peak, a goal, a card, a save. Both voices, overlap allowed, prosody unlocked.
The same match. Twelve booths. Twelve languages.
Once the booth is software, the question of "what language are we calling this match in?" stops being a budget question and starts being a feature flag. The state of the match is one thing. The voices are another. They run in parallel, on the same event stream, in any number of rooms.
Twelve simultaneous booths, each in a different language, each with its own pair of agents tuned for that audience, each shouting at the same goal at the same instant, and you, the listener, picking the one whose accent feels like home. This is, mostly, just a thing that ships, because we built it that way from day one.
One match. Twelve concurrent booths. Same state, different rooms. Listeners pick their voice the way they pick a seat.
Sport is the wedge. The booth is the product.
We started with football because football is the hardest live event in the world to call well, fluid, low-scoring, narratively dense, religiously watched, played in a hundred languages on the same Saturday. If we can do football, we can do anything that has a clock and a feed. And almost everything worth watching has a clock and a feed.
The mission is not to be a sports company. The mission is to be the booth. Sport is where we prove the booth can hold its own under the hardest live conditions humans have invented for it. After that, the booth follows the audience.
A non-exhaustive list
- Esports, every minor tournament that can't afford a casting team. The 4 a.m. Korean qualifier nobody watches because nobody's calling it.
- Markets, earnings calls narrated as they happen, an analyst voice on every ticker, dropping out and tuning in like radio.
- Politics, the live floor of a parliament you don't speak the language of, called by a booth that does.
- Conferences, keynotes with a second-screen booth pointing out, in real time, what's actually new.
- Auctions, debates, races, regattas, lectures. Anything where two people sitting somewhere watching could turn motion into meaning.
The first season is football. The seasons after that are everything else.
What we will never do.
This kind of system is easy to misuse. We've thought a lot about how. The list below is short on purpose. These are not gotchas; they are conditions of the work.
What we don't know yet.
The honest part. There are things we are confident in. There are things we are deliberately quiet on. There are things we genuinely cannot predict, and we'd rather say so than pretend.
- Will listeners come to love an agent voice the way they love a human one? We think yes. The early evidence is positive. We don't know the size of the effect.
- How does the booth feel during a tragedy? A serious injury, a fan incident, a moment of silence. The booth needs reflexes for these we are still tuning.
- Will leagues see this as a partner or a threat? We are partnering with the ones who see it the first way. The conversation is real and ongoing.
- What happens to human broadcasters? Honest answer: the great ones are not threatened. The middle of the market is. The total amount of broadcast in the world goes up by orders of magnitude. We think this is a net good. We are watching it carefully.
If you have an answer to any of these, or a better question, write to us. The manifesto is a draft.