Nicolo' Brandizzi

Poolside's Model Factory: Honest Engineering, and the Data It Can't Show You

Nicolo' Brandizzi — Wed, 27 May 2026 00:00:00 GMT

TL;DR; A lab now headquartered in Europe (Poolside) shipped a seriously strong open model, XS.2, in five weeks, and unlike most “Teaser Papers,” they actually showed us the assembly line: the Model Factory, end-to-end lineage, hash-checked distributed training. This is the opposite of the open-washing I complained about last time. The one thing they don’t reveal, where the data comes from, isn’t them being cagey. It’s the visible scar of a regulatory regime that forces a European champion to choose between being commercially viable and being scientifically open. You can’t do both here. That’s the real story.

📍 Originally published at nicolobrandizzi.com.

A nice surprise, for once

A while back I wrote an angry post about “Teaser Papers”: gorgeous PDFs from Big Tech that read like science but behave like advertisements. Beautiful results, poetic intuition sections, and then a single line: “we trained on a high-quality dataset of N trillion tokens.” End of paragraph. I called it a denial-of-service attack on academic research, and I stand by every word.

So you can imagine my reaction when I sat down with the XS.2 technical report (the “Laguna” report) from Poolside¹, and found… the opposite. XS.2 is a Mixture-of-Experts model built for long-horizon agentic coding (33.4B parameters total, ~3B active), with open weights released under Apache 2.0². They built it in five weeks³ and then spent pages telling you how the machine works, not just how well it scored.

I want to talk about it, because it’s the kind of report I keep asking for. And because the one place where it goes quiet is, for once, not the lab’s fault, and that’s worth a whole section on its own.

(Full disclosure on bias: I’m an EU-based AI researcher, my PhD is in reinforcement learning and language modeling⁴, and I’m congenitally happy whenever a European lab does something good. Calibrate accordingly.)

The part European labs should photocopy: the Model Factory

The thesis of the report is almost boring in how sensible it is: treat foundation-model development as an industrial process. They call their stack the Model Factory: a set of components and pipelines whose entire job is to automate the plumbing so the researchers can spend their scarce hours on actual research questions instead of babysitting infrastructure.

Two things in here are, to me, the whole game.

1. End-to-end lineage. Every change (data, config, code) is tracked with full provenance. This sounds like a DevOps footnote. It is not. When you track everything, your ablations fall out of the experiment log instead of being a separate, soul-crushing campaign of re-runs. Even better: you can come back months later and ask questions you didn’t know to ask at the time, and answer them with proper statistical tools instead of squinting at a graveyard of spreadsheets. This is the difference between a lab that has data about itself and a lab that merely produces data and then forgets it. Honestly, if you take one thing from the report, take this.

2. They’re transparent about distributed training. Optimization is the part everyone does and almost nobody writes about; it usually gets dismissed as plumbing, beneath the real science. Poolside spends real ink on it, including a detail I loved: hash checks to catch silent (and non-silent) failures in large-scale runs.⁵ If you’ve ever had a multi-node run quietly corrupt a shard and poison a checkpoint without throwing a single error, you know this unglamorous paragraph is worth more than half the benchmark tables in the field. Their custom systems work has clear market value, and they shared it anyway.

This is exactly what I begged for in the SOTA Trap post: mechanism, not just metrics. Recipe, not just the finished meal. Poolside delivered the recipe. Credit where it’s due.

The one thing they don’t share, and why I’m not mad about it

Here’s the gap. The report is detailed about the machinery of data curation, including a clean decomposition of document quality into independently learnable properties (a noise axis and an information axis) recombined into a composite score.⁶ But it stays quiet on where the training data actually comes from.

In my old post, that silence would have earned a full rant. Here it doesn’t, because the situation is different.

A US or Chinese lab that stays vague about data origins is usually choosing opacity as strategy: commoditize the complement, keep the recipe, win. A European lab faces a different incentive structure. Under the current European regime (the AI Act’s transparency duties for general-purpose models, plus a copyright framework built around text-and-data-mining opt-outs⁷), spelling out your sources mostly buys you copyright litigation. A lab established in the EU sits squarely in that blast radius; a competitor in San Francisco has a fair-use defense to fall back on and feels it far less, even though the Act reaches it too once it serves the EU market.

So Poolside did the responsible thing inside the box it’s stuck in: it opened up the process, which advances the science, and stayed careful about the provenance, which would invite a lawsuit. I can’t be angry at a company for refusing to fall on a sword only European companies are asked to fall on.

Which brings me to who I am a little angry at. But first, some gifts.

Ideas the report sparked

Reading a good report is generative; it makes you think. So here are a handful of directions XS.2 nudged me toward. None of these are criticisms; they’re the things I’d be excited to argue about over coffee.

Make the data mixture adaptive, not static. Poolside’s AutoMixer learns a surrogate model mapping data-mixture proportions to downstream capability metrics, then optimizes the mixture over that learned surface.⁸ It’s elegant. My one itch: the mixture it produces is essentially static: solved once, decoupled from the model’s current knowledge state. But what a model needs from its data at 200B tokens isn’t what it needs at 2T tokens. I’d love to see the mixture recomputed periodically on cheap distilled proxy models during the run, treating data mixing as curriculum scheduling rather than a one-shot allocation. This connects to something the report itself observes: once tokens are abundant, the bottleneck shifts from “maximizing precision under scarcity” to “controlling repetition and diversity under long-horizon training.”⁹ When you have enough tokens, when you feed something starts to matter as much as whether you include it. That’s a curriculum problem hiding in a data-mixing coat. Geometrically, AutoMixer optimizes a single point on the data-mixture simplex (x ∈ Δᵈ in the paper); a curriculum is a path across that simplex, which is what the figure below traces, with training time as the vertical axis.

Static mixture (AutoMixer)

Static Adaptive

An interactive figure. A triangle is the data-mixture simplex over three buckets (Web, Code, Education). Clicking Adaptive extrudes a training-time axis so the static single point lifts into 3D and travels along a path through the triangle over time. Drag to rotate.

Illustrative: the corners are three example data buckets and the path shape is invented, not Poolside's published mixture.

Multi-objective RL might finally be production-ready. Their reward-design section made me sit up. I spent my PhD on reinforcement learning, and “multi-objective RL” has long lived in the “great in theory, cursed in practice” drawer. Seeing serious reward engineering at this scale makes me wonder if it’s time to drag it out of that drawer and treat multi-objective RL as something that genuinely works in production, not just in a paper with three toy environments.

Welcome back, centralized-vs-decentralized RL. I had to laugh at the trainer-to-inference weight sync machinery. We are, in a very real sense, back to the old centralized/decentralized RL debates, except this time the axis isn’t algorithmic elegance, it’s GPU efficiency. Everything old is new again, just more expensive.

And the one I can’t help wanting: theory of mind. This is my bias showing, but the half of my thesis that wasn’t RL was theory of mind: a model’s ability to represent and match the understanding and context of the human it’s working with.¹⁰ For a model whose whole purpose is collaborating with developers, I’d have loved to see even a gesture toward ToM in the agent design. Matching the agent’s model of the task to the human’s model of the task is, I’d argue, a defining quality of a genuinely useful agent, not just a capable one. Maybe next report.

(Poolside, if you’re reading this: those are free. You know where to find me. 😉)

The real subject: the false choice

Now, the part I’m actually annoyed about, and notice it isn’t pointed at Poolside.

European researchers and European companies are being forced onto two diverging tracks, moving at completely different speeds:

The open track: document your sources in enough detail to expose yourself, honor every machine-readable opt-out, and absorb the copyright risk that comes with it. You can mitigate it (filtering, takedown pipelines, opt-out detection), but that’s months of work and lawyers a startup racing to ship in five weeks does not have.
The viable track: stay quiet about provenance, ship, survive, and accept that your “science” is now only partly falsifiable from the outside.

The asymmetry comes down to one thing: a fair-use defense. In the US, a lab can argue that training on copyrighted material is fair use. That argument is unsettled and being fought hard right now (billion-dollar settlements, the New York Times suing OpenAI),¹¹ but it is a real legal shield, and Europe offers no equivalent. Under EU copyright law there is no fair-use escape hatch: if a rightsholder reserves their work you have to honor it, and the AI Act makes you publish a summary of what you trained on.

Those disclosure rules do reach any lab that sells a model into the EU, so this is not Europe binding only its own. But a European lab is the most exposed of all: based where enforcement is closest, open to local copyright suits, and with no fair-use defense to fall back on. That pressure splits European AI in two: publicly funded efforts that can afford to be fully open, and commercial labs that stay closed to survive. I keep coming back to that divide in these posts, where the two sides lean in different directions even while sharing the same public compute.

Let me be fair to the EU here. GDPR, the AI Act’s transparency rules, and the copyright opt-out all come from a defensible place: protecting people’s privacy, giving creators a real say over whether their work trains a model, and pulling AI out of the black box so the public and rightsholders can see what went in. The goal is to protect citizens and creators, and that goal is right. My argument is narrower: the net effect on European builders cuts against the very competitiveness Europe keeps saying it wants.

You cannot fault a company for choosing to survive. You can fault a structure that makes openness commercially suicidal for its own champions, and then wonders why those champions go quiet, or get acquired, or relocate.

I want more European labs like Poolside, not fewer. That is why the regulatory environment around them needs to stop forcing this trade-off.¹² We are building out AI Factories to give Europe compute. Good. Now we need a data-and-disclosure regime that lets a European lab be both fast and open without betting the company on it. Otherwise we will keep producing exactly this: excellent work, transparently engineered, with a polite silence where the most scientifically interesting paragraph should be, and we’ll keep blaming the labs for a choice we made for them.

Closing

The XS.2 report is one of the more honestly engineered reports I’ve read in a while. It gave me the mechanism I keep demanding from the field, it sparked half a dozen research arguments I’d happily lose, and it was transparent about everything its lawyers would allow.

To be fair about the word “honest,” though: the model that impressed me most on openness was Apertus, the Swiss LLM from EPFL, ETH Zurich and CSCS that released its training data and full recipe, not just its weights. That contrast is the whole point, and no knock on Poolside. Apertus could open its data because it did the slow compliance work a publicly-funded academic project has time for: filtering to respect opt-outs, stripping personal data, documenting everything. A company shipping in five weeks does not have that runway. Same continent, two sets of rules, and only one of them gets to be fully open.

Fix that asymmetry, and Europe might have a shot.

As always: if I got something wrong, or you want to fight me about adaptive data mixtures, theory of mind, or the AI Act, tell me. I can’t get better without feedback.

Notes

Poolside was founded in 2023 in San Francisco by Jason Warner (ex-GitHub CTO) and Eiso Kant; it later relocated its headquarters to Paris and is backed by, among others, French investor Xavier Niel. So “European” here means established and operating in the EU, which (as we’ll see) is exactly what puts it in the regulatory crosshairs, not “European-born.”
XS.2 weights are on Hugging Face under Apache 2.0. The report positions it as “competitive with state-of-the-art open models in their respective weight classes”, so “strong open model,” not “GPT-5 at home.” Credit to them for shipping the weights, not just the benchmarks.
Their words: “Building XS.2 from inception to delivery in five weeks was only possible because we treat foundation model development as an industrial process.” Five weeks. I’ve spent longer than that fighting a single SLURM queue.
For the curious: Conversational agents in human-machine interaction: reinforcement learning and theory of mind in language modeling. Yes, that “theory of mind” bit becomes relevant later.
“Critically, we found hash checks important to prevent silent and non-silent failures of large-scale training runs.” The word “silent” is carrying a lot of trauma.
A “noise axis” capturing whether a document is mostly junk, and an “information axis” capturing educational / informational / pre-training value, recombined into a composite contribution score. Clean idea. I’d still love to see the labeling distribution (maybe a v2?).
The relevant pieces are the EU AI Act’s transparency requirements for general-purpose AI (including a public summary of training content via the Commission’s template, with GPAI obligations applying from August 2025) and the DSM Directive’s (2019/790, Art. 4) text-and-data-mining exception with rights-holder opt-out. One honest caveat: the training-summary duty is extraterritorial; it applies to anyone placing a GPAI model on the EU market, not only EU providers. So the asymmetry I’m describing is really about copyright-litigation exposure and enforcement reach landing hardest on the company with a European address, not a clean EU-only obligation. I’m compressing a lot of nuance here.
Formally: learn a surrogate ℳ from data mixtures to downstream evaluation metrics, then optimize the mixture against the learned surrogate. The static-vs-adaptive trade-off is real (recomputing burns compute), but proxy models are cheap, and the prize (less wasted training) is large.
Their framing: “The challenge shifted from maximizing precision under scarcity to controlling repetition and diversity under long-horizon training.” This is one of the most quietly important sentences in the report.
I will take any excuse to talk about theory of mind in agents, and I’m not sorry.
The Bartz v. Anthropic author class action settled for $1.5 billion, and The New York Times v. OpenAI remains in active litigation through 2026. US fair use for AI training is a real defense, but far from settled.
A contested read, worth flagging. The Draghi report argues the EU regulatory stack hampers innovation; others counter that capital access and market fragmentation, not the AI Act, are the binding constraint. The EU’s May 2026 Digital Omnibus has already simplified and delayed parts of the Act, so the regime is moving, not frozen.

KlarText: Engineering Agents That Know When to Stop

Nicolo' Brandizzi — Tue, 26 May 2026 00:00:00 GMT

TL;DR; At Fraunhofer IAIS I built KlarText, a multi-agent system that translates dense German administrative text into Leichte Sprache (Easy Language). Any modern LLM can paraphrase, so the translation is not the interesting part. The engineering problem is making a loop of autonomous agents that terminates instead of spiralling, judges itself with code rather than vibes, and records every decision so a human can audit what changed. This post is about those three constraints, plus a section on how I actually evaluated whether it works.

📍 Originally published at nicolobrandizzi.com.

The problem nobody can opt out of

In Germany, Leichte Sprache (Easy Language) is not a nice-to-have. Federal authorities are legally required to provide it under §11 of the Behindertengleichstellungsgesetz, with accessibility duties layered on by BITV 2.0 and the Onlinezugangsgesetz.¹²³ And they are failing: the 2025 federal monitoring report found that only 27.6% of public-sector websites offered any simple-language content, and 0% met full BITV 2.0 compliance.⁴ (There is no fine for getting this wrong, which is part of why the numbers stay this bad. The pull here is ethical and reputational, not a penalty.)

The reason it stays unsolved is cost. Human translation runs roughly €70 to €200 per Normseite (1,800 characters), because it includes a review step by people with cognitive disabilities, the Prüfgruppe.⁵ A typical site is dozens of those pages, and the content keeps changing. So the obvious move is “throw an LLM at it.” Except the German federal accessibility body has pushed back on exactly that: generic AI tools, they argue, work at the language level but do not handle structure and content, offer no compliance traceability, and cannot replace human validation.⁶

That critique is the actual spec. It is not “translate well.” It is “translate well, and prove it, and do not pretend the human is gone.” That reframing is what turned this from a prompt into a systems-engineering problem.

Why one prompt was never going to work

The naive version is a single megaprompt: “You are a Leichte-Sprache expert. Follow these 28 rules. Keep all the facts. Output clean text.” It fails in a predictable way, because you are asking one model, in one pass, to optimise two orthogonal objectives:

Style compliance. Short sentences, no compound words, active voice, no subjunctive, no foreign words, and a couple of dozen more rules.
Factual retention. The simplified text must still say the same things as the original.

These trade off against each other. The easiest way to satisfy every style rule is to throw information away, so a translation can be stylistically perfect and factually empty. The easiest way to retain every fact is to keep the complex sentence structure. Collapse both axes into one number (“is this a good translation? 0.78”) and you can never debug it, because when the score drops you have no idea which axis broke.

So I split the work across four agents, each with exactly one job.

The four-agent loop

Agent Cares about How it decides Translator Following the system prompt One LLM call. An optional best-of-N mode generates several candidates and selects among them. Supervisor Style compliance An LLM agent, but one anchored to ~30 deterministic rule scores rather than its own taste. It reads the rule report, applies the thresholds, and gates the loop. Questioner What was in the original Generates comprehension questions (by default 5 multiple-choice and 2 open-ended, both configurable) from the original only. It never sees the translation. Answerer Whether the translation still answers them Answers those questions from the translation only, scored by embedding similarity against the keyed answers.

The two gates are independent and catch different failures. The Supervisor catches “this is not Leichte Sprache.” The Questioner and Answerer together catch “this is beautiful Leichte Sprache that quietly dropped the deadline, the amount, and who to call.” That second gate is the clever bit: the Answerer is graded against an artefact it never saw. It is not asked “is this a good translation,” it is asked a factual question and simply tries to answer it from the output. If the answer is gone, the score falls. That is a measurement, not an opinion, and it is the antidote to LLM-as-judge, where a model grades its own work and converges on confident, agreeable nonsense.

Both gates feed retry loops, and the whole point is that those loops are nested and bounded:

A rejection is never a blind retry. When the Supervisor rejects a draft, the candidate goes back to the Translator together with the Supervisor’s rule-by-rule report, so the next attempt is aimed at the exact rules that failed rather than rerolling the dice. When the whole loop passes (or gives up), the output is not just the simplified text: it is the text plus the rule-score breakdown and the readability score, which is the transparency the whole design is built around.

Hard part #1: a loop that actually stops

The moment you let agents retry each other, you have invented a way to spend unbounded money and time. “Refine until good” is a lovely sentence and a production incident waiting to happen: the Supervisor rejects, the Translator tries again, the Supervisor rejects again, forever, because some rules are inherently noisy and no rewrite will ever hit a perfect score.

The discipline that makes this safe is deliberately boring:

Two thresholds, not one. Each rule produces a compliance score between 0 and 1. The two thresholds decide what happens when a rule scores low. Below the retry threshold (default 0.95), the rule counts as a genuine failure and the whole draft is sent back for a rewrite. In the band just above it, below the warn threshold (default 0.98), the rule is merely flagged in the report but allowed through. Above that, it passes silently. Both thresholds are user-settable, like every knob here. The warn band exists because some rules are inherently noisy: without it, the system would keep rewriting to chase a hundredth of a point that means nothing.
Hard caps on both loops. The style loop runs at most 3 times, the retention loop at most 3 times, so the worst case is nine model rounds. Then the system stops, returns the best candidate it found, and attaches the failing scores so a human knows exactly where it gave up. Exhaustion is a documented outcome, not a crash, and it is also what bounds the cost per document: the loop physically cannot run away.
Generate the questions once per outer loop, not per attempt. This one is subtle. If you re-roll the comprehension questions on every retry, a weak translation can eventually luck into an easy question set and “pass.” Freezing the questions for the whole outer loop removes that escape hatch.

None of this is glamorous. All of it is the difference between a demo and something you would let near a public budget.

Hard part #2: a judge made of code, not vibes

The accessibility body’s middle complaint, no compliance traceability, is the one that is hardest to fake and most valuable to solve. Every competitor I looked at returns simplified text. None of them tell you which rules they satisfied and by how much. KlarText does. The Supervisor is an LLM agent, but it does not eyeball the draft and emit a gut feeling; its verdict is anchored to a deterministic rule report, and its prompt explicitly tells it to defer to those numbers rather than to its own taste. The rules themselves are roughly 30 plain functions running over a spaCy parse of the German text, grouped into words, sentences, texts, and numbers. Each returns a score and the evidence behind it. A few, just to give the flavour:

Verbs over nouns. German officialese nominalises everything (“die Beantragung der Genehmigung”). The rule scores the ratio of verbs to nouns; more verbs reads as more human.
Split complex sentences. It scores the fraction of sentences with no subordinate clause, the nested dass / weil / obwohl constructions you have to read twice.
No compound words. German welds nouns into monsters like Bundesausbildungsförderungsgesetz, a real barrier for the target reader. A dedicated detector flags them and scores the fraction of words that stay simple.

The rest of the catalogue is in the same spirit: short sentences, no subjunctive, positive phrasing, no idioms, plain dates and numbers. The point is that “no compound words” is a function over a string, not a feeling. When the score moves, you can point at the exact words that moved it, which is what makes the result defensible to a regulator and mappable onto DIN SPEC 33429, the first consolidated German standard for Easy Language.⁷

The part I am most proud of: making the scores comparable

There is a trap hiding in “just average all the rule scores.” The rules live on completely different scales. Some sit near the top for almost any text (Roman numerals are rare everywhere). Some barely move at all (the verb-to-noun ratio is about the same whether a text is simple or not). Some swing across the whole range. Average them raw and the loud rules drown out the quiet ones, and a real improvement on a quiet rule looks like nothing.

The fix is to stop grading each rule against an abstract scale and start grading it against how it normally behaves on ordinary German. I ran roughly 17,600 everyday German documents through the rule engine once, to learn what a typical score for each rule looks like. After that, a new score is not reported as a bare “0.4,” it is reported as “unusually high” or “about average” for that rule. Scores also get discounted when there is little text to judge, because a verdict drawn from two sentences should not weigh as much as one drawn from two pages.

The result is a report where “+0.2 here” means roughly the same thing as “+0.2 there.” That is what lets thirty very different rules be combined into one honest verdict, and lets a reviewer read it without a statistics degree.

Hard part #3: traceability, or what the agents actually did

When four agents pass messages and retry each other, “it produced a weird output” is an un-debuggable bug report unless the system records itself. Every run writes a timestamped, uniquely-named folder with four files:

trace.md: the full agent-to-agent interaction,
prompt.md: every prompt actually sent to a model,
system_prompt.md: the system prompt each agent started with,
translation.txt: the final translated text.

Underneath, each job moves through an explicit sequence of phases:

drafting → evaluating → questioning → answering → done (or failed)

carrying the iteration counter, both score sets, and an event log as it goes. The UI subscribes to that and renders a live trace panel, so you watch the loop think in real time, and the comprehension questions and answers are surfaced to the user as a window into the “facts not dropped” gate.

This is the part people skip and regret. An agent system without a trace is a black box that occasionally lies. With the trace, every rejection is attributable: this rule, this score, this retry, this prompt.

P.S. I fed the agents their own diary 🔁

Here is the fun part. Because the trail is structured .md rather than a wall of free text, it is also a dataset. Point an audit agent at a stack of trace.md and prompt.md files and it will hand back grounded, specific notes: which rule keeps tripping the loop, which prompt drifts on retry, where the nine-iteration budget actually goes. The system that writes the logs becomes the raw material for improving the system that wrote them. Slightly uncanny, genuinely useful, and the only reason it works is that the logs were structured from day one.

Why I cared about the interface

An autonomous agent loop is invisible by default. It runs, it spends tokens, and it hands back an answer with no account of how it got there. For a research demo that is fine. For something a public agency might actually adopt it is fatal, because the people who have to sign off, the reviewers and the compliance officers and the manager holding the budget, do not trust what they cannot see. A great backend with no surface is not a product. The interface is where an agent system stops being a black box and becomes something a non-engineer can reason about, question, and ultimately trust.

So KlarText is not a script, it is an app. You paste or upload a document, pick a model, and you immediately get the boring-but-useful facts up front: word count, characters, an estimated token cost, reading time.

On launch: pick a model, drop in a document, and see what you are about to process before you spend a token on it.

Then you hit start and watch it think. The loop does not run behind a spinner, it narrates itself: which agent is active, which iteration it is on, and a best-so-far readability score that ticks upward after every pass.

One run, three moments: the Questioner writing comprehension questions, the Translator revising on feedback (iteration 1), and the Supervisor checking compliance (iteration 3 of 3). The best-so-far readability score climbs underneath as it goes.

When it finishes you get the simplified text and, beside it, three headline numbers that are really the three gates from this whole post made visible: a readability score, a comprehension score (the QA-retention gate, “can the questions still be answered”), and a rule-compliance score. Everything, including the full run logs, is downloadable.

The result: the simplified text up top, then readability, comprehension, and rule-compliance scores, each with the change against the original.

And if a headline number is not enough, you open it up. Every rule category expands into its individual rules, each with its score and its delta, so a reviewer can go from “is this compliant?” straight to “which rule, by how much, and compared to what.”

Drilling into the detailed evaluation: every rule, its score, and how it moved. This is the "compliance traceability" the regulator asked for, made literal.

For a compliance tool this is not decoration. The visible, inspectable rule scores are the one thing the competitors do not show, so the interface is where that advantage actually lands.

Does it actually work? Evaluation

A transparency story is worthless if the output is bad, so I evaluated it against Klexikon, a corpus built from the German children’s encyclopedia that pairs standard articles with human-simplified versions.⁸ But the whole evaluation rests on one assumption worth saying out loud: that the step from a Klexikon article to its simplified version is the direction we want KlarText to move in. That is only partly true. Klexikon is written for children, and easy language for children is not the same thing as Leichte Sprache for adults with cognitive disabilities. A question in a kids’ article is a good thing, it invites the reader to think, whereas Leichte-Sprache guidance treats questions much more cautiously. So whenever the system “loses” to the human reference, it is genuinely ambiguous whether the system is wrong or the reference simply belongs to a slightly different register. I learned to read every number below with that asterisk attached.

With that caveat in hand, here is how the machine output stacked up against the human references, rule by rule:

The system beat the humans on the structural rules (splitting complex sentences, sentence fragments, positive phrasing, avoiding the subjunctive) and lost on word-level choices: preferring verbs over nouns, and simple-word choice (the latter only after I fixed a bug in the rule itself). On the rest it was roughly tied. That is a system that is good at surface structure and still weaker than a human on word judgement, which is about where you would expect a rule-guided LLM to land.

I also tried three German readability formulas as an overall sanity check: the Flesch reading-ease in its German adaptation,⁹ the Wiener Sachtextformel,¹⁰ and the LIX index.¹¹ Only the Flesch variant moved in the same direction as human simplification, so I kept it and dropped the other two. Pruning a metric because it does not track reality beats keeping it because it sounds authoritative.

Two more caveats I will not hide. The baseline statistics came from a partial Common Crawl pass (the job died around the halfway mark) and should be redone with fuller coverage before anyone leans on the absolute numbers. And, crucially, the system has not been through a Prüfgruppe, the validation by people with cognitive disabilities that is the real legal bar for Leichte Sprache. KlarText shrinks the volume a human has to review. It does not remove the human, and it was never meant to.

The lesson

The interesting parts of KlarText have almost nothing to do with how well an LLM writes German. Models are good at that now and getting better for free. The engineering, the part that is actually mine, is the scaffolding of constraints: a loop that provably halts, two orthogonal gates instead of one opaque judge, deterministic and normalised scoring you could hand to a regulator, and a trace for every decision.

Orchestrating agents well is mostly an exercise in saying no. No infinite refinement. No self-grading. No incomparable numbers. No silent loss of facts. Get those right and the agents look smart. Skip them and you have built an expensive way to be confidently wrong, which, in a domain whose entire point is accessibility for people who depend on it, is the one outcome you cannot ship.

Thanks for reading. If you are wrestling with bounding or auditing your own agent loops, I would genuinely love to compare notes.

Notes

Behindertengleichstellungsgesetz (BGG). §11 covers Leichte Sprache; §12 accessible IT; §13 the federal monitoring body (BFIT-Bund).
BITV 2.0, the Barrierefreie-Informationstechnik-Verordnung, especially §4 on Leichte Sprache. It references EN 301 549 V3.2.1 (WCAG 2.1).
Onlinezugangsgesetz (OZG), the law on improving online access to administrative services.
BFIT-Bund, Second Report on Periodic Monitoring of Accessibility Requirements (March 2025). Table 37: 27.6% of websites provided simple language in 2024; 0% achieved full BITV 2.0 compliance overall.
Simple-language translation is billed per Normseite (1,800 characters), roughly €70 to €200 depending on the provider and whether a review group (Prüfgruppe) is included. Compare Netzwerk Leichte Sprache, Klar und Deutlich, and Büro Leserlich.
BFIT-Bund, Position Paper on AI Translation Tools for simple language.
DIN SPEC 33429, “Empfehlungen für Deutsche Leichte Sprache” (March 2025), DOI 10.31030/3594547. The first consolidated German standard for Easy Language.
Klexikon dataset, built from the German children’s encyclopedia, pairing standard articles with simplified versions.
Flesch Reading Ease, in the German adaptation by Toni Amstad (1978), Wie verständlich sind unsere Zeitungen? Background: Flesch–Kincaid readability tests.
Wiener Sachtextformel, Richard Bamberger and Erich Vanecek (1984). Background: Lesbarkeitsindex (Wiener Sachtextformel).
LIX (Lesbarkeitsindex), Carl-Hugo Björnsson (1968). Background: Lix readability test.

I Got 99 Observations and Your Bike Ain't Safe

Nicolo' Brandizzi — Wed, 04 Mar 2026 00:00:00 GMT

TL;DR; I spent seven months writing down four digits every time I went to the gym (which, as the data will reveal, was not as often as I’d like). Turns out your 10,000-combination lock is really a 100-combination lock, one digit is protected by an accident I’m calling Thumb Physics, and I built way too many interactive charts to prove it.

📍 Originally published at nicolobrandizzi.com.

The Scene of the Crime

I live in an apartment building in Bonn. In the basement there’s a shared bike storage room, locked with one of those 4-digit rotary combination locks you’ve probably seen a hundred times: four little wheels, each showing a digit from 0 to 9, arranged horizontally. About 20 households share this lock, which means at least 20 people know the code, they spin the wheels to open it, park their bike, and (hopefully) scramble the digits before leaving.

10,000 possible combinations. Should be safe, right?¹

Last year my building had a series of break-ins². Someone got into the common areas, the private garages, even stole a couple of electric bikes from locked parking spots. But the bike room? Untouched. At first I was relieved. Then I started wondering: what if the thieves came back? They broke in once with alarming ease, and if they were patient enough to swing by the bike room every now and then, jot down what the lock shows, and keep a little spreadsheet going… could they eventually figure out the code and help themselves to everyone’s bikes?

Because every time someone opens that lock and scrambles it, the digits they leave behind aren’t random. Not even close. They’re the product of a quick thumb swipe across a physical mechanism, and if there’s one thing we know about humans, it’s that we are spectacularly bad at being random³. We think we’re scrambling, but what we’re actually doing is leaving a very readable fingerprint.

So I did what any reasonable person would do. I started writing down the numbers. 🕵️

Every time I went to the bike room, morning or evening, I recorded the four digits visible on the lock before entering the code. I kept this up from August 2025 through March 2026, seven months of diligently staring at a small metal object in a basement, 99 observations in total⁴.

📊 Interactive chart: Collection
This chart is interactive on the original post — explore it live here.

A confession.

I've been a fan of Distill.pub for its entire lifetime. If you've never seen it: it was an online journal for machine learning that treated presentation as a first-class citizen. Interactive visualizations, marginal notes, beautiful typography. When they shut down in 2021, I was sad.

But recently I discovered that the Distill template, the web components, the styles, the whole framework, is still open source. Anyone can use it.

So this post is also an experiment: my first attempt at bringing Distill-style visualizations to my website. If you notice an unreasonable number of interactive charts in what follows, now you know why.

I regret nothing.

The Naive Attack

A thief walks up to your lock. They’ve been watching. What do they see?

The Wheels

Let’s start simple. Each wheel can show any digit from 0 to 9, and across 99 observations, some digits show up a lot more than others. The question is: how much more?

For each of the four positions I counted how often each digit appeared, and the result is below, shown as four polar bar charts (one per wheel). You can think of each chart as a top-down view of the physical wheel: digits 0 through 9 arranged clockwise around the circle, with bar length showing how often that digit was observed. The faint inner ring marks 10%, which is what you’d expect if scrambling were truly random⁵.

📊 Interactive chart: Wheels
This chart is interactive on the original post — explore it live here.

Take a moment to look at them.

Position 1 has a clear winner, one digit towering above the rest. Position 4 is similar, with a strong favourite. Position 3 is more contested, two digits fighting for the top spot. And position 2? Position 2 is a mess. There’s a vague winner, but nothing that screams confidence.

Now, if you were a thief and could only look at these charts, the obvious strategy would be to pick the tallest bar at each position and call it your guess. Statisticians call this “the mode” (the most frequent digit), but you can also call it “the obvious one”. That’s the naive frequency attack: just go with whatever shows up the most.

Now Guess

So let’s do exactly that. Pick the mode at each position. You get: 4-4-4-7.

📊 Interactive chart: Guess
This chart is interactive on the original post — explore it live here.

Three out of four. Not bad for staring at histograms.

Positions 1, 3, and 4 are correct. Position 2 is wrong: the mode says 4, but the truth is 7. In practice this means a thief using this strategy has one digit left to figure out. Lock in the three confident positions, try all 10 options for the remaining one, and you’re done in under a minute.

But hold on. Are those three “correct” guesses equally trustworthy? Position 1’s mode dominated its chart, while position 3 barely won what was essentially a two-horse race. Should we really trust them the same way?

Spoiler: no. And that’s where things get interesting.

How Confident Should You Be?

Not all three correct guesses are equally reliable. Position 1’s mode appeared 28 times out of 99, with the runner-up at just 14. Position 3’s mode got 22, the runner-up 17. That’s a gap of 5 observations, which is to say: a couple of different neighbours showing up on a couple of different days, and the ranking could easily flip.

So how do we figure out which guesses to actually trust?

Bootstraping!

Here’s a thought experiment. Imagine you have a bag with 99 marbles, and each marble has a digit written on it, matching one of your observations for a given position. You shake the bag, pull out a marble, write down the digit, and put it back⁶. Then you do it again. 99 times total.

Now you look at your 99 drawn digits and ask: what’s the mode? Is it the same digit that won in the original data?

Maybe. Maybe not. Some marbles got picked twice, others got skipped entirely. The histogram looks slightly different every time. If the original winner had a big lead (like position 1, where 4 leads), it will almost certainly still win. But if it was close race (like position 3), a few unlucky draws and suddenly the runner-up takes over.

That’s called bootstrapping, and if the name sounds ridiculous, that’s because it kind of is. The idea is that you can pull yourself up by your own data, generating confidence from the same observations you already collected. I ran this process 100,000 times (my laptop was not happy about it) and counted: in how many of those simulated experiments does the original mode still come out on top? That percentage is your bootstrap stability, basically a measure of how much you should trust the guess.

Try It Yourself

Don’t take my word for it.

📊 Interactive chart: Resampler
This chart is interactive on the original post — explore it live here.

Click “Resample” a few times and watch what happens. Position 1 is absurd: digit 4 wins 97.3% of resamples, with a gap ratio⁷ of 78.8x over the runner-up. It doesn’t even flinch, no matter how many times you click. Position 4 is solid too: digit 7 takes it 80.9% of the time, gap ratio 5.0x.

Now look at position 3. Digit 4 wins 73.8% of resamples, but digit 1 is right there at 18.9%, breathing down its neck. The gap ratio is just 3.9x. This guess is probably right, but if you asked me to bet my bike on it, I’d want a few hundred more observations first.

And position 2? This is where it gets fun. On the surface, it looks stable: digit 4 wins 64% of resamples with a gap ratio of 3.4x. Seems perfectly fine, right? But the true digit at position 2 is 7. Poor, poor 7. It wins just 0.6% of resamples. The bootstrap is very confident here. Confidently wrong.

We’ll come back to that.

For now, the takeaway: positions 1 and 4 are cracked. Position 3 is likely cracked but shaky. And position 2 is telling us a very convincing lie.

So where does this leave our thief?

The Attacker’s Playbook

Now, I don’t want to give anyone ideas (okay, maybe I do, a little), but if you’ve been following along, you might have already realized that our hypothetical thief doesn’t need to be particularly smart. They just need to be patient and slightly methodical.

You’ve been watching the lock. You’ve done the frequency analysis. You know positions 1 and 4 are almost certainly 4 and 7. Position 3 looks solid: digit 4 wins 73.8% of resamples with a gap ratio of 3.9x. Position 2 is your weakest link: digit 4 wins 64.0% with a gap of just 3.4x.

A naive thief just tries 4-4-4-7 and either gets lucky or doesn’t. But a smarter thief reasons differently: lock in the two high-confidence digits (positions 1 and 4) and brute-force the remaining two. Positions 2 and 3 each have 10 possible digits. That’s 10 × 10 = 100 combinations to try.

Your 10,000-combination lock just became a 100-combination lock.

And an even smarter thief (we’re really levelling up here) doesn’t try them randomly. They rank the 100 codes by likelihood and work down the list, most promising first. Attempt #1: 4-4-4-7, the all-modes guess. Attempt #2: swap in position 3’s runner-up: 4-4-1-7. And so on. Each attempt takes maybe 3 to 5 seconds on a rotary lock. If the true code happens to be near the top of the ranked list, you’re done in under a minute. Even in the worst case, 100 attempts is 5 to 8 minutes.

The Search Space

Here’s what the thief’s ranked list looks like.

To see why this works, let’s look at the full space of 10,000 possible codes, ranked by how likely they are under our frequency model⁸. The green bars are codes inside the attack zone (everything matching 4-?-?-7), the gray bars are everything else. You’ll notice that some gray codes rank higher than many green ones, and that’s because a code like 4-4-4-0 can still score well if its individual digits are all frequent, even though it falls outside the attack zone.

📊 Interactive chart: Searchspace
This chart is interactive on the original post — explore it live here.

The true code, 4-7-4-7, ranks #19 out of 10,000 by likelihood. But that rank isn’t stable: across bootstrap resamples it bounces around quite a bit⁹. What matters is that about 73% of the time, the true code falls within the top 100. The attack zone works.

So here’s the recipe for our thief. Observe the lock about 100 times. Pick the mode at positions 1 and 4. Rank the remaining 100 combinations by frequency. Start at the top. On a good day, you’re in within a couple of minutes.

The Stubborn Digit

This is the part that kept me up at night. And I mean that almost literally, because once you see the pattern, it’s the kind of thing that makes you stare at your ceiling wondering if statistics is broken.

Position 2’s true digit is 7. In our data, it appears 9.1% of the time. That’s below the 10% you’d expect from perfectly random scrambling. The mode, digit 4, shows up at roughly 20%. So not only does the true digit lose, it loses to what should have been random noise. It’s like entering a race and finishing behind someone who wasn’t even trying.

And here’s the kicker: more data makes this worse, not better.

To understand why, I did what any self-respecting data person does when reality refuses to cooperate: I simulated a version of reality where I could control the sample size and watch what happens. Take the observed frequency distribution at position 2 and use it as ground truth. Draw N synthetic observations from it, look at the frequencies of digit 4 and digit 7. Repeat 20,000 times. Do this for N = 10, 25, 50, 100, all the way up to 5,000.

It’s essentially the same bootstrapping idea from before, just with a dial for sample size. And the results are, honestly, a bit devastating.

The Divergence

The chart below shows two lines: the observed frequency of digit 4 (the wrong mode, in red) and digit 7 (the true code, in green). The shaded bands are the 95% confidence intervals at each sample size. Drag the slider to change N.

📊 Interactive chart: Divergence
This chart is interactive on the original post — explore it live here.

Here’s what you’ll notice. The gap between the two lines never changes. It’s always about 10.1 percentage points, whether you’re looking at N=10 or N=5,000. What changes is the uncertainty around each line. At small N, the bands are wide and overlapping, and you genuinely can’t tell the two digits apart. Maybe digit 7 really is more common and you just got unlucky. There’s hope.

Now drag the slider to the right. Watch the bands shrink. By N=100 (roughly what I collected), they barely touch. By N=500, they’re fully separated. No overlap. No ambiguity. Digit 4 is observed more often than digit 7, and it’s not even close.

The gap was always there. More data doesn’t close it. It just removes any hope that it might be noise.

A thief looking at this data would have absolutely no reason to doubt the conclusion. Every statistical test in the book would confirm it: digit 4 appears twice as often as digit 7 at position 2. Pick 4. Move on. And they’d be wrong.

This is not a sample-size problem. It’s a signal problem. The scrambling process at position 2 actively buries the true digit under a false favourite. Three out of four positions are recoverable with enough patience. But the fourth is protected, not by design, but by accident.

Which brings us to the obvious question: why? What is it about position 2 that makes the data lie?

The Physics of a Thumb

I’m fairly sure Thumb Physics is not an established branch of science. If it is, I’d like credit for arriving at it independently. If it isn’t, I’m claiming it. Put it on my tombstone.

Anyway. Let’s step back from the statistics for a moment.

The lock is horizontal. Four wheels in a row. When someone scrambles it, they don’t carefully and independently rotate each wheel like some kind of safecracker in a movie. They swipe a thumb across all four wheels in one quick motion, usually left to right, and call it a day.

This simple physical fact explains a lot. But not everything.

Inner vs Outer

The outer wheels (positions 1 and 4) sit at the edges of the thumb’s arc. They get clipped, nudged, sometimes skipped entirely. The inner wheels (positions 2 and 3) are right in the middle of the swipe and receive the full force of the rotation.

The data backs this up. Outer positions stay unchanged (showing the true digit) roughly 26% of the time, while inner positions only about 10 to 16%¹⁰. The difference is statistically significant. Your thumb simply doesn’t reach the edges as well as the middle, which makes intuitive sense if you’ve ever tried to swipe four tiny wheels with one finger.

This explains why positions 1 and 4 are easy to crack: they don’t get scrambled enough, so the true digit still shows through in the frequency data. But it doesn’t explain position 2’s specific problem. Why does digit 4 dominate there when the true digit is 7?

The Mystery of Position 2

I had a beautiful theory for this: the “neighbour effect.” Position 2 sits between two 4s, so maybe the thumb leaks the neighbours’ digit into it. Elegant! Then I checked position 3, which sits between two 7s. Its mode is 4, not 7. Theory dead on arrival.

Here’s what we actually know. Digit 4 is the most common digit at three out of four positions. Across the entire lock, it appears at 19.4%. It’s not just popular at position 2, it’s popular everywhere. It seems to be generically “sticky” on this lock, showing up far more than its fair share regardless of what the true digit is.

Why? I don’t have a satisfying answer, and believe me, I tried. It could be something about the physics of this particular lock, or maybe something about rotary locks in general, that we simply can’t disentangle with one lock and 99 observations. If you’re a locksmith or a biomechanics researcher reading this, I would love to hear your theories.

The data shows the pattern clearly. The mechanism remains a mystery.

99 observations later, the lock and I have a complicated relationship.

So, Can You Steal My Bike?

Probably. Yes.

Positions 1 and 4 are trivially recoverable, with the true digits dominating the frequency data from as few as 25 observations. Position 3 is recoverable but shaky, the kind of result that would benefit from a few hundred more data points before you’d feel truly confident. And position 2 is protected by what I can only describe as a statistical accident: the true digit appears below random chance, and the more data you collect, the more confidently the analysis points you in the wrong direction.

Your 10,000-combination lock provides the security of about 100 combinations. A patient observer with roughly 100 data points (seven months of gym trips, in my case, with some embarrassing gaps in between) can rank those 100 by likelihood and brute-force through them in a few minutes. About 73% of the time, the true code is in there.

Or you could just cut the lock. Probably faster.

The dataset is available at the end of this post if you want to replicate the analysis or try to do better than naive frequency. I’m still collecting data¹¹. And no, I haven’t changed the code. For science.

⬇ Download dataset (LockDigits.csv)

Notes

We also had a break-in in our apartment that year. It was rough. On the bright side, it eventually led to us getting our two cats, Alba and Nera.
. Grazie, ladri.
Figurska, Stańczyk & Kulesza (2008), “Humans cannot consciously generate random numbers sequences.” Published in Medical Hypotheses. Best paper title in the history of science.
Not 100. I stopped at 99 because I wanted you to feel the same mild frustration I felt staring at position 2 (more on that later). Also, I use the bike mostly to go to the gym, so the data collection was… seasonal. There are some heroic stretches of daily observations and some long, suspiciously lazy gaps where I clearly didn’t leave the house weeks.
If every digit were equally likely after scrambling, each would appear 10% of the time. That ring is your null hypothesis, staring back at you.
This “put it back” part is important. It means the same observation can be picked more than once in a single draw, and some observations won’t be picked at all. This is what statisticians call sampling with replacement. It’s the key ingredient that lets you simulate variability from a single dataset.
The gap ratio is simply how many times more often the winner beats the runner-up across all resamples. A gap of 54x means digit 4 won 54 times more often than the second-place digit. In the visualization, you can see it as the bracket between the top two bars in each panel.
The likelihood of a code is the product of per-position observed frequencies. For code 4-4-4-7, you multiply the observed frequency of digit 4 at position 1 × digit 4 at position 2 × digit 4 at position 3 × digit 7 at position 4.
Median: 39. Mean: 90. 95% CI: [3, 503]. Sometimes the true code is in the top 10, sometimes it’s in the hundreds. About 58% of the time it’s in the top 50.
Fisher’s exact test gives p=0.018 for the inner/outer asymmetry. Significant at the 5% level.
If you live in my building: hi 👋. Please keep scrambling normally. It’s for research.

The SOTA Trap: Why We Are Working for Big Tech for Free?

Nicolo' Brandizzi — Tue, 09 Dec 2025 00:00:00 GMT

TL;DR; Big Tech papers are increasingly citing themselves and hiding their data recipes, effectively launching a Denial of Service attack on academic labs. By forcing researchers to waste months reverse-engineering “advertisements” disguised as science, they paralyze the competition while extracting free QA testing in return.

📍 Originally published at nicolobrandizzi.com.

Introduction

Last week, I was sitting in a lab meeting, and the vibe was… tense.

We were discussing a new paper dropped by a major tech company (you know the ones). The abstract was beautiful. The results were SOTA. The “intuition” section was poetic. But when it came to the implementation details? Vague.

“We used standard hyperparameters,” the paper claimed.
(Narrator: They did not, in fact, use standard hyperparameters.)

As my colleagues debated how to reverse-engineer the learning rate schedule and how many thousands of GPU hours we’d need to burn just to verify their baseline, I had a realization that left me like:

We are not doing science. We are doing free Quality Assurance (QA) for Big Tech.

We are taking their “Teaser Papers” (which are essentially advertisements for their proprietary models) and spending months of our lives trying to make them work. We debug their frameworks (PyTorch/JAX), we optimize their architectures, and we validate their claims. And what do we get in return? A citation?

It felt like a trap. We are chasing their tail, incentivized to replicate their “Scale” results, while they hold all the keys (the data and the compute).

And once you see it, you can’t unsee it. And the data backs this up.

The Disappearing Act: Where is Europe?

If you want to see who is winning this “Scale Game”, you don’t need to look further than the acceptance statistics for NeurIPS 2025 ¹.

I was shocked to read the preliminary numbers.

China has surged to capture roughly 36% of accepted papers.
The US stands its ground, matching that volume with another 36%.
Tsinghua University is now neck-and-neck with Google for the top spot, while US Big Tech (Google, Meta, Microsoft) fortifies the American flank.

And Europe? We are being squeezed out, dropping to just 15% of the total share (a stinging 2% reduction from last year).

Now, you might point out that this is partly due to “Home Bias” (the tendency for researchers to cite and review their own heavily). And yes, recent studies confirm that Chinese researchers cite each other at a rate of nearly 57%² (against the 37.1% from the US), creating a massive, self-reinforcing echo chamber.

But blaming “Home Bias” misses the bigger, scarier point.

The problem is that the “game” of AI research has shifted entirely to Capital-Intensive Science. To get a paper into NeurIPS in 2025, you often need to run experiments that cost more than an entire university department’s annual budget.

The “Compute” Fallacy

People often look at this and say, “Oh, Europe just needs more GPUs! That’s why we have the AI Factories!”

But I’m starting to think it’s not only about hardware. Yes, we need the compute, but the US Big Tech companies aren’t just winning because they have H100s. They are winning because they have created a closed loop of “Open-Washing”.

They publish papers that look open but are actually insular. A new study on “Industry Insularity” found that Big Tech papers increasingly cite only other industry papers³. They are building a walled garden of knowledge that references itself to validate itself.

Europe is trying to play “fair” in a game where the other two players are either:

Flooding the zone with volume (China).
Privatizing the recipe while publicizing the result (US Big Tech).

We are stuck in the middle, wasting our limited resources trying to replicate results that might not even be reproducible in principle.

The Mechanism: “Teaser Papers” and Open-Washing

So, how exactly are they doing this?

If you look closely at the “Open Source” contributions from Big Tech, you’ll notice a pattern. They don’t release science; they release software.

They give us the weights (Llama, Mistral) and the frameworks (PyTorch, TensorFlow, JAX). But they systematically withhold the most critical component of modern AI: the Data Recipe.

This has given rise to what I call the “Teaser Paper”. A Teaser Paper is a document that looks like a research paper, smells like a research paper, and is formatted in LaTeX like a research paper. But it is actually a marketing brochure.

It usually goes like this:

The Architecture: Fully described (because it’s usually just a standard Transformer with a minor tweak).
The Results: Cherry-picked SOTA benchmarks.
The Data: “We trained on a dataset of 5 trillion tokens filtered for quality.” (End of section).

This is a textbook execution of a business strategy famously described by Joel Spolsky twenty years ago: “Commoditize Your Complement”⁴.

The idea is simple: If you sell Proprietary Models (the product), you want the Infrastructure (the complement) to be cheap and ubiquitous.

By releasing PyTorch and Llama, Big Tech drives the cost of building AI down to zero.
But by keeping the Data and Training Engineering secret, they ensure that the value stays locked inside their servers.

They want us to be expert mechanics for their engines, but they don’t want us to know how to build the engine factory.

The Denial of Service Attack on Science

This brings me to the part that really keeps me up at night.

Hiding secrets is only half the story. The more insidious issue is that these “Teaser Papers” actively sabotage the rest of the research community.

Think about it. When DeepMind or OpenAI publishes a vague paper claiming a massive breakthrough, what happens in labs across Europe (and the world)? Hundreds of PhD students and researchers drop what they are doing. They spend the next 3-6 months trying to replicate that result.

They burn millions of GPU hours trying to guess the “standard hyperparameters” that were omitted. They waste weeks scraping data to match the “quality” described in one vague sentence.

This is effectively a Distributed Denial of Service (DDoS) attack on academic research.

Instead of working on new ideas, new architectures, or new paradigms, the brightest minds in our universities are tied up acting as unpaid Quality Assurance testers for Big Tech products.

If we succeed in replicating it? Great, we just proved their product works.
If we fail? We assume we did something wrong, not that the paper was misleading.

We are trapped in a “Reproducibility Loop”, running on a hamster wheel designed by marketing departments in Silicon Valley. And while we are busy fixing their bugs, we aren’t building the next thing that could make them obsolete.

Conclusion: Stop Chasing, Start Leading

So, what do we do? Do we just give up and let the gap between “Open AI” (the marketing term) and “Open Science” (the practice) widen until we become irrelevant?

No. But we need to stop taking the bait.

We need to collectively agree that a paper without code and data is not a scientific contribution; it is a press release. We should stop treating these “Teaser Papers” as the gold standard to be chased. When a lab spends six months replicating a result vaguely described by a corporate team, they are essentially donating public resources to a private shareholder meeting.

A New Mandate for Academic AI

For us in Europe (and academia globally), the path forward can’t be to try to out-scale Meta or Google. We will lose that game every time. We don’t have their margins, and we don’t have their user data.

Instead, we need to return to Mechanism, not just Metrics.

Don’t build bigger; build smarter. Focus on efficiency, interpretability, and the theoretical underpinnings that Big Tech ignores in their race to add another trillion parameters.
Democratize the stack. Support initiatives that actually release the recipe, not just the meal.
Value Reproducibility. A paper that explains why something works on a single GPU is infinitely more valuable to the scientific community than a paper that claims 99% accuracy on a cluster we will never have access to.

We have the talent. We have the intuition. But we need to stop acting like unpaid interns for Silicon Valley.

It’s time to stop working for free, and start doing science again.

Notes

Credit to AI World. I highly recommend their interactive dashboard—toggle between 2024 and 2025 to watch the European contribution vanish in real-time.
“Paper Tiger? Chinese Science and Home Bias in Citations” (NBER 2024). A fascinating read if you want to see the raw numbers on citation cartels vs. structural bias.
“Big Tech-Funded AI Papers Have Higher Citation Impact, Greater Insularity, and Larger Recency Bias” (arXiv, Dec 2024). Basically, Big Tech likes to talk to itself.
“Strategy Letter V: Commoditize Your Complement” (Joel Spolsky, 2002). It’s old, but it explains 90% of what is happening in AI today.

Teaching Machines to Think: Reinforcement Learning and Reasoning in LLMs

Nicolo' Brandizzi — Wed, 15 Oct 2025 00:00:00 GMT

TLDR; We’ve taught machines to talk, but not to think. This post walks through how reinforcement learning (RL) helps large language models (LLMs) reason, using lasagnas, grocery shopping, and a few mild existential crises along the way.

Before We Begin

It’s been a while since I wrote a blog post. I’ve been wanting to for months, but nothing ever felt like the right topic.

If you judged me only by my posts, you’d probably think I’m some kind of policy or data-science communicator. You wouldn’t be entirely wrong. But my real field (my PhD topic) is AI, specifically language modeling and reinforcement learning ¹.

A while back, I was asked to give a presentation at a Lamarr meeting on how reinforcement learning is used in large language models to induce reasoning. It took me weeks of literature digging, but I ended up happy with it: a dense, technical presentation about how reasoning actually gets trained into these systems.

I could just put that talk into writing, but that’s not really what I enjoy doing. What I enjoy is science divulgation, making difficult things understandable. As a scientist, I feel responsible not only for advancing research but for removing the illusion that it’s magic. Somewhere out there, in the infinite space of possible sentences, there’s a combination of words that can make anyone understand anything. That’s the one I’m always looking for.

So today, I’m trying again, this time to explain how reinforcement learning teaches language models to reason, with plenty of analogies².

The images you’ll see along the way were generated with AI, because if we’re going to talk about machine reasoning, it feels only fair to let machines illustrate it too.

If it wasn’t already obvious from the TL;DR, here’s the plan: we’ll start with a quick primer on RL, move to how it’s used to train reasoning in LLMs (yes, we’ll talk about GRPO), and finish with what I think are the most promising future directions.

If, after reading, something still feels unclear, tell me. I mean it. I can’t get better without feedback (spoiler).

Alright. Let’s start.

Why Bother Teaching Machines to Think?

Did you read the title? Why should you care? I mean, ChatGPT writes my emails, sometimes even my code. Isn’t that good enough?

Well, yes and no.

Modern language models can write fluently, summarize papers, and even solve logic puzzles. But they often do all this without really understanding what they’re doing. They sound intelligent, yet when a problem requires several steps (planning, checking, revising) they can lose track of the goal, contradict themselves, or just guess.

This raises a deeper question: Can we teach language models to think, rather than just predict the next word?

📍 Originally published at nicolobrandizzi.com.

Reasoning About… Grocery Shopping

To answer that, we first need to unpack what reasoning actually means.

According to the Cambridge Dictionary, reasoning is “the process of thinking about something in order to make a decision”. Let’s take it apart:

The process implies something that happens over time, repeated steps that build on each other.
Thinking about something means imagining, anticipating, or mentally simulating an outcome.
In order to make a decision reminds us that reasoning has a purpose. It’s not thinking for the sake of thinking; it’s about getting somewhere.

Now, think about grocery shopping.

You walk in with a few vague goals (dinner tonight, breakfast tomorrow) but no fixed plan. You look around and start making choices:

State: you’re standing in front of the vegetables.
Action: you reach for the tomatoes.
Reward: you remember you already have some at home… bad move. You put them back.

Each small decision updates your plan. Maybe you realize pasta would go well with the leftover sauce. You mentally simulate: “If I buy pasta, I’ll need cheese… do I still have some?” That’s reasoning. Evaluating possible futures and picking the one that best fits your goals: a good dinner, minimal waste, staying on budget.

Now imagine doing the same thing inside a machine. A reasoning AI must also plan, evaluate, and adjust its steps. Not just predict what comes next, but what’s useful next.

Why Reinforcement Learning?

Notice something familiar in that grocery story? The trio (state, action, reward) isn’t there by accident. It’s the backbone of reinforcement learning (RL).

So why RL? Because reasoning isn’t static; it’s interactive. You take an action, observe what happens, and adjust. That feedback loop is exactly what RL captures.

Of course, RL isn’t the only way to approach reasoning, but it’s one of the most promising. So, for the rest of this post, we’ll assume RL can be the steering wheel that guides reasoning forward.

RL 101: Learning by Doing

If you already know how reinforcement learning works, feel free to skip ahead. But if you don’t, don’t worry. It’s simpler than it sounds.

You go through it every day. When you’re hungry, you open the fridge and grab an apple. You eat it. You feel better. That’s RL… in a way.

Being hungry was your state. You didn’t like it, so you changed it with an action (eating the apple). The outcome, not being hungry anymore, is your reward³.

It might sound trivial, but that’s the essence of how RL works: it’s a framework for learning from interaction. At its core, there’s always the same triple (state, action, reward). In this case: (hungry, eating, feeling better).

But wait! Whose state are we talking about? And who decides the reward?

Good questions. We’re missing two more ingredients:

The agent: that’s you. In our example, you’re the one acting: hungry, deciding, eating. In general, the agent is whatever interacts with the world.
The environment: that’s everything else, the world responding to your actions. When you touch fire and it burns, that’s the environment teaching you a lesson.

The environment provides the reward; the agent learns from it. And this back-and-forth is the foundation of RL.

Reinforcement vs Supervised Learning: Who Drives Better?

If you’ve spent any time around AI, you’ve probably heard the term supervised learning. So how is reinforcement learning different? Roughly in the same way that studying for a driving test differs from actually driving a car.

Imagine you want to become a world-class driver. You spend years poring over manuals; the theory of torque, road signs, braking distances, every page of the highway code. When exam day comes, you ace every question. But the moment you sit behind the wheel, you stall at the first turn.

What happened? You know everything about driving, but you’ve never experienced the feedback loop of steering, correcting, and feeling how the car responds. Your knowledge is second-hand, distilled through other people’s experiences. That’s supervised learning: learning from examples that someone else has already labeled.

Reinforcement learning, on the other hand, is learning by driving. You take the wheel, make mistakes, get real-time feedback (sometimes a reward, sometimes a penalty), and adjust your behavior. Over time, your reflexes sharpen, your timing improves, and you stop thinking about every move: you’ve learned through experience, not description.

Both methods are valuable. You can’t pass a driving exam without studying the rules, but you also can’t drive safely without practice. Theory gives you structure; experience gives you judgment. The same holds for AI: supervised learning provides the foundation (the grammar, facts, and associations) while reinforcement learning teaches the model how to act, decide, and self-correct.

In large language models, however, nearly all the effort goes into the studying part. Hundreds of billions of words are fed into models through supervised learning, while reinforcement learning, the actual driving practice, represents only a tiny final phase of training. That’s why most models can speak so well, yet still struggle to think through the curves.

How RL Teaches Machines to Reason

Either you skipped here or actually read the introduction. In any case, you’re in for a ride (pun intended).

In this section, we’ll go through the state of the art in RL for reasoning in large language models. So, strap in.

Alright, But How Do They Actually Reason?

We’ve seen what reasoning means, and we’ve talked about LLMs. But how do these two actually work together?

Step 1: Pretend to Think

When we say a model “reasons”, we usually imagine it writing down its thoughts step by step, like

“First, I’ll do this… then that.”

That’s what we call token reasoning: reasoning in the open, expressed through text. It’s transparent and easy to supervise because we can literally read what the model is thinking and check whether each step makes sense. But that view can be misleading. As researchers from Anthropic have shown, models often don’t think this way internally. They produce these explanations because we trained them to sound like they’re reasoning, not necessarily because they are reasoning.

The real thinking happens under the surface, in what we call the latent space. You can imagine it as the model’s inner world, a space where ideas mix and evolve before turning into words.

It’s similar to what happens in your own head before you speak. You might compare prices of potatoes or picture tonight’s lasagna without saying a single word. That silent, intuitive, pre-verbal process is the latent space.

Both forms of reasoning coexist. The first makes reasoning visible to us. The second makes reasoning possible for the model.

Thinking the GES Way

Let me introduce you to what I like to call GES

GES stands for Generate, Evaluate, and Select, and it forms the basic foundation of reasoning in LLMs. Let’s break it down.

Generate: From the current step (or state, if you prefer), we want to generate possible future thoughts. Think of this like grocery shopping. You already have tomato sauce for your pasta. Should you get onions? Carrots? Broccoli? You start generating options: what’s tastier, what’s closer, what fits your plan?
Evaluate: Now you assess each option. How? That depends, but let’s stick with the shopping analogy. You realize onions would make your breath stink and you have a date later, so you give that option a low score. Carrots sound better.
Select: You now have a few good candidates, each with a score. Time to choose. Maybe you go with carrots. But you might also choose to explore instead of exploit. Try the broccoli just because you’ve never made pasta with it before. Exploration is part of reasoning too.

Choosing the Best Move

I promised we’d get to evaluation, so here we are. How do you choose the best action out of so many options?

Well, you have a few choices. And since I’m getting hungry, let’s talk food again.

Cooking Lasagna

It’s Saturday morning. Tonight you’re hosting dinner. You promised lasagna. The ingredients are all on the counter, the one thing missing is confidence. But that never stopped you before, so you begin.

Here’s where the timeline splits. Three versions of you are now making lasagna, each with a different kind of feedback.

1. I Need a Chef

In this universe, Chef Ramsay materializes by your side⁴. He’s made so many lasagnas he could put an Italian grandma to shame. At every step, he gives feedback: loud when you’re wrong, less loud when you’re right. He never touches anything, but he keeps you on track.

With his help, you get through the whole process. The lasagna looks solid. The guests are on their way.

2. I Am the Chef

This version of you wasn’t blessed by teleporting Ramsay. But you’ve watched plenty of cooking shows and tried a few lasagnas before.

At each step, you check what you’ve done and correct yourself. The sauce looks too pale? Add more tomatoes. The water tastes flat? Add salt. You adjust as you go, trusting your own sense of taste and memory.

After a while, it looks right. Now it’s up to the guests to decide.

3. I Am All Chefs at Once

Here, you get creative. Why make one big lasagna when you can make 50 small ones and experiment?

You vary the salt, tomato, and ragu in each, bake them all, and taste them one by one. Some are disasters. One tastes amazing. You check your notes, replicate that version full-scale. Now the lasagna is ready and you hear the bell ring.

We’ve seen three ways of making lasagna, but what does that have to do with reasoning in LLMs?

Here’s the parallel:

Expert opinion: also called a learned heuristic. You use another neural network trained to recognize the right action at each step (your Chef Ramsay).
Self-evaluation: similar to frameworks like ReAct. The model reflects and corrects itself using its own knowledge.
Simulation: the model runs quick mental “what-if”s using a simulated environment (the heat, ingredients, taste). These are approximations, not reality, but they help anticipate which direction is promising.

When Reasoning Meets Compute

By now, we’re familiar with all the ingredients: RL, LLMs, reasoning, and how they mix together. So what now?

When I was compiling my literature review, I desperately tried to draw a clear line that could separate all the different RL approaches to reasoning. I never found one. Most methods overlap, reuse the same tricks, or differ only in detail.

That left me with just one dimension to classify them by: computational complexity.

Through that lens, RL methods for reasoning fall into two broad camps:

1. Search-Based Methods

Also known as the “let’s throw more compute at it” approach.

Here, at every step of the reasoning path, the model generates many possible answers and evaluates them all. Imagine a tree growing inside a house, trying to find a way through a small hole in the roof. At each branch, the tree splits again and again, reaching in every direction, until one lucky branch finally makes it out.

That’s what search-based methods do: grow as many branches as possible, hoping one finds the light. Quantity over efficiency.

These approaches are best represented by the AlphaZero family of methods and are conceptually close to actor–critic methods.

Powerful? Absolutely. But also incredibly expensive. When the tree is an LLM, every branch is a multi-billion-parameter computation. You can afford that for small puzzles, not for models already burning megawatts.

2. Reward Supervision

The other approach focuses on quality over quantity. Instead of exploring every possible path, we try to guide the model toward the right one using feedback.

Back to the tree analogy: instead of growing blindly in every direction, this one follows the sunlight shining through the roof hole. It’s not as exhaustive, but it’s smarter: each new step is informed by where the light seems stronger.

Algorithms like PPO and GRPO belong here. These methods are iterative and incremental: the model acts, gets feedback, adjusts, and acts again.

Search-based methods are often too costly to scale to LLMs, which already push the limits of available compute. So, in this post, we’ll follow the more sustainable path, reward supervision, and see what’s inside.

Reward Supervision

As we said, reward supervision is about guiding the reasoning process through feedback. The key question is: how often should we give that feedback?

Feedback at Every Step

Let’s go back to our lasagna.

In the Gordon Ramsay version, the chef keeps yelling advice at every step. That’s process supervision: feedback all along the way, not just at the end.

You can picture it in two ways. Either you have a value network, your personal Gordon judging each move you make, or preference tuning, where you’ve read a thousand recipe books and learned what “good” looks like from examples.

The idea is simple: correct the process as it happens. The problem is obvious: someone has to know when each step is right or wrong, and that’s not always clear. Is the sauce too thick? Is the oven too hot? You can’t always pause mid-cooking to ask the oracle.

Process supervision is precise but expensive. It works when you know what “good” means at every step, less so when you don’t.

Feedback at the End

We never actually learned what your guests thought of the lasagna. Well, lucky you, they liked it!

That’s outcome supervision: you only give feedback at the end. This is where PPO and GRPO live.

But it’s not all gold that shines.

Let’s say that while cooking, you developed a strange habit: every time you added an ingredient, you stopped to clean the spoon. By the end, the lasagna turns out great. Your guests are impressed. And now, in your mind, washing the spoon twenty times is part of what made it work.

That’s the issue with outcome supervision. You don’t really know which actions helped and which just slowed you down. You can end up reinforcing the wrong behaviors simply because they happened before success. The model, too, might learn to clean the spoon instead of fixing the sauce.

Are you lost?

Don’t be!

So far, we’ve covered the essentials:

How reasoning in LLMs can be seen as a loop of state, action, reward.
How feedback shapes behavior: sometimes at every step, sometimes only at the end.
Why too much compute (search-based methods) doesn’t always mean better reasoning.

The big picture? We’re teaching machines not just to act, but to learn from their own actions.

Ready? Let’s see how that actually works, starting with PPO.

The Two Flavors of Feedback

Now that we understand outcome supervision, it’s time to look at the two most famous acronyms in this space: PPO and GRPO.

PPO

PPO stands for Proximal Policy Optimization. Let’s break it down.

Policy: I never mentioned this explicitly back in RL 101, but when agents act, they do so through a policy. You can think of it as a set of beliefs about the world, a personal rulebook that helps you decide what to do next.
Optimization: Easy one. This just means we’re trying to improve that rulebook over time finding the best set of actions to reach our goal. Think of it as refining your lasagna recipe: a little less salt, a little more sauce, each version slightly better than the last.
Proximal: This is the tricky part. Proximal to what? In short: you want your updates to be close to what already worked before. Don’t throw away your recipe entirely. Small, careful adjustments keep the model stable. You don’t suddenly start making tiramisu in the middle of cooking lasagna.

So here we are again, chasing that idea of the “best” action, the same logic behind GES. We still need to evaluate each step and decide which actions to reinforce. How do we do that?

Advantage

The answer is advantage (it’s there in the title for a reason).

Advantage is the “extra goodness” of a chosen action compared to what you’d normally expect. It helps the agent remember (through policy updates) which decisions turned out better (or worse) than usual.

How do we calculate it, you ask? That’s where things get interesting.

Let’s stay with our faithful lasagna example. You’re experimenting with oven temperatures: 10 °C, 150 °C, and 500 °C. You bake three trays. One’s raw, one’s perfect, one’s charcoal. Clearly, 150 °C wins. Easy, right? That’s naïve RL: take the direct reward of each action and call it a day.

But what happens next time, when your options are 110 °C, 160 °C, and 170 °C? Your previous results don’t help much… you need a way to generalize what “better” means.

1. Subtract a baseline

Suppose, on average, your lasagnas score about 6 out of 10 on the satisfaction scale. If today’s batch earns an 8, its advantage is +2. If it’s a disappointing 4, the advantage is –2. You’re no longer comparing each lasagna in isolation, you’re comparing it to your usual outcome. This keeps learning focused on what’s better than normal, filtering out random good or bad luck (like the time you accidentally bought premium mozzarella

2. Use a value function

But averages don’t last forever. Maybe you got a new oven, or you’ve learned to layer the béchamel properly. Your baseline should evolve too. That’s where the value function comes in: a model that estimates how promising your current setup already is before you start baking. It looks at the state of the kitchen (ingredients, oven type, your stress level) and predicts the likely outcome. It’s an adaptive baseline rather than a static one.

3. Generalized Advantage Estimation (GAE)

Finally, some results appear instantly (you can see the sauce bubbling just right) while others only reveal themselves later, when the dish cools and you take that first bite. Generalized Advantage Estimation (GAE) combines both: it smooths the signal between short-term and long-term feedback, reducing randomness without losing detail. It’s like judging not only how good each layer looks now, but also how the whole lasagna tastes once it’s done.

In short:

Method What it means In our kitchen Naïve RL Use the raw reward Taste each lasagna and pick the best one Subtract baseline Compare to average “Better than my usual 6/10?” Value function Predict expected outcome “This recipe looks promising already” GAE Blend short- and long-term rewards “Smells great and tastes great after cooling”

Each step makes learning less noisy and more stable.

Back to PPO

So, going back to PPO, the actual process looks something like this:

You start from a given state (just beginning your lasagna).
You use your knowledge of lasagna making (your policy) to take an action. Say, preheating the oven.
Gordon Ramsay tells you that preheating was indeed a good choice. You then estimate the advantage of that action compared to not preheating, using GAE.

That’s essentially what’s happening in the diagram below (taken from the GRPO paper).

Now, let’s see how that changes with GRPO.

GRPO

You might notice that PPO and GRPO sound suspiciously similar. That’s because GRPO stands for Group Relative Policy Optimization. We already know what policy optimization means, so what’s this group-relative part about?

Well, up until here, we’ve been mentioning the “imagine multiple possible future actions” as a thing, but PPO only goes for one. That’s what GRPO does differently.

Here’s how it works. GRPO tries out several possible actions at once and assigns a reward to each. It then compares all those rewards and updates the policy based on which action performed best within the group.

Sounds familiar? It should. It’s just like our “I am all chefs at once” example, where you cooked 50 mini lasagnas and kept the best one. Same spirit, different kitchen.

And honestly, that’s most of what there is to it.

Recap

Before we move on to the final part, let’s take a quick breath and see where we are. Main takeaways so far:

The goal is always to predict the value of the next action.
You can explore many possible actions (search-based) or improve gradually with feedback (reward supervision).
Feedback can come during the process or at the end.
In policy-based RL, we use advantage to estimate how good an action was.
Since advantage can fluctuate wildly, researchers invented clever ways to stabilize it.
The latest and shiniest of these methods (for now) is GRPO.

So, are we done? Not quite.

The elephant in the room

Can you hear that? You tried to ignore it, but the rumble behind you is getting too loud.

There’s an elephant in the room, and its name is Spurious Rewards. This paper came out in June 2025 and made quite a bit of noise (like the elephant).

Remember the limitation I mentioned earlier about outcome supervision? How a model can reach the right answer even with the wrong reasoning steps? The authors decided to push that to the extreme, and, surprisingly, it worked.

They trained models with GRPO, but instead of giving meaningful rewards, they just rolled dice. Did you preheat the oven or summon Cthulhu

with a candle? Doesn’t matter.

If the roll came out positive, congrats, your lasagna just improved.

The shocking part? The Qwen model family still got better.

To understand why, we need to leave our lasagna behind for a moment.

“Qwen” refers to a family of models from the same lab (one of which is DeepSeek), the same team that introduced GRPO. Like siblings raised in the same city, Qwen models share similar “upbringing”: the same data, tools, and training habits. One of those habits is that, when solving math problems, they were trained to use code.

Ask, “What’s 4 + 10?” and Qwen quietly opens its inner calculator (a Python interpreter) and returns 14.

So, what’s the deal? Why focus on Qwen? Because the same researchers who invented GRPO tried applying their random-reward trick to other model families (like LLaMA from Meta) and it failed completely. In fact, it made them worse.

That’s the key point. The same random reinforcement that made Qwen stronger made other models weaker.

What does that tell us?

The authors suggested that GRPO wasn’t teaching the model to reason better, but to use what it already knew more effectively. Think of it like cooking: if you already know your ingredients by heart, even random encouragement helps you find shortcuts, but you’re not learning any new recipes.

What We Can Infer

Two lessons here. First, when we teach a model to reason in token space (basically writing its thoughts step by step) we risk ending up with a system that’s just summarizing what it already knows instead of truly reasoning. Second, model family matters. If an algorithm works wonders for one lineage, don’t assume it’ll do the same for another.

So if GRPO didn’t really make Qwen think better, maybe reasoning itself needs a new ingredient: something that happens between steps, not just at the end.

Where We Go From Here

Now that we’re all familiar with the basics, it’s time for me to tell you about the pattern I see.

Recurrence and Reasoning

If you think about it, reasoning isn’t a single leap, it’s a chain. One step follows another, and each new thought depends on the previous one making sense. You can’t just jump from “tomato” to “quantum field theory” without building the bridge in between. That continuity, that ability to connect steps through time, is what we call recurrence.

Recurrent systems don’t just react; they remember⁵. They carry context forward, refine their internal state, and adjust their trajectory with every iteration. Humans do this effortlessly: we loop through ideas until they make sense. Language models, on the other hand, mostly don’t. They predict one token, then move on, forgetting the internal dynamics that led there. That’s fine for fluent text, but disastrous for reasoning, where every decision depends on what came before.

So if we want models that actually think through a problem instead of spitting out the next probable word, we need recurrence.

Spatial and Temporal Recurrence

There are two ways to get there:

Temporal recurrence means carry information through time. That’s the classic approach, similar to old-school RNNs or humans keeping a mental thread while solving a problem.
Spatial recurrence means revisiting the same internal representation multiple times before moving on. It’s like rewriting the same paragraph until it finally says what you meant all along.

Here’s the interesting part: spatial recurrence is⁶ diffusion. Diffusion models start from noise and refine it step by step until it turns into something meaningful. Each pass brings the system closer to a coherent, well-formed output.

That’s reasoning too: taking a messy thought and polishing it until it makes sense.

And the best part? You can get this diffusion-like recurrence without changing the architecture. The same transformer can learn to “think twice” (or ten times) simply by changing how it’s trained. We’ve already seen this in other domains: reinforcement learning with diffusion has been explored extensively for images. What’s missing is applying the same principle to reasoning.

If we treat reasoning as another modality, something that can be iteratively denoised into clarity, we can reuse all the tools we already have. Instead of denoising pixels, we’d be denoising thoughts.

Latent reasoning

Until now, everything we’ve talked about happens in token space, the world of words. That’s the visible part of reasoning: the model writing things down step by step, like showing its homework. But that’s not where real thinking happens. When you reason, you don’t narrate every step out loud, most of it stays in your head. That inner, silent process is what we call latent reasoning.

Inside a model, that “head” is the latent space: a world built from data, not experience. And that difference matters. Because when a model is trained, it doesn’t live in that world; it just builds it from observation. It’s like if someone tried to design your bedroom just by looking at millions of bedrooms online. They’d get the general idea right (a bed, a lamp, a closet) but the details would feel… off.

Maybe the closet is slightly too far, the pillow textures don’t match, and the socks are under the desk for no reason. Everything is there, but the geometry is uncanny.

That’s what a pretrained model’s latent space looks like: statistically correct, structurally weird. It looks right, but it’s not necessarily usable.

Now imagine someone gives you a task: every morning, you must find your socks. You wake up, stumble around, and realize it’s a mess. So, the next day, you move the socks closer to the bed. Then you fix the lighting. Then you put the chair where it’s actually useful. Over time, you start changing the layout; not because you were told to, but because it helps you solve your daily task faster. You’re searching through your room and reshaping it in the process.

That’s what the Latent Program Network paper did. They used search to actively shape the latent space: giving the model repeated tasks like “find the socks” and letting the structure reorganize itself so that important things become easier to reach. This approach actually won the ARC 2024 challenge (the Abstraction and Reasoning Challenge), showing that structured search can make latent spaces more logical and navigable.

But here’s the next step. What if instead of being asked to look for socks every morning, you actually lived in the room? Not just searching for things, but moving, adapting, learning from the environment continuously. That’s where reinforcement learning comes in. RL wouldn’t just tidy the room for one task , it would let the model inhabit it, adjusting the space through experience until it feels natural to think there.

Wrapping Up

So, what did we learn? Let’s ask Le Chat to make a poem for us

You taught them words, but not the way,
to plan the meal, or weigh, or stay.
To shop for thoughts, to taste, to try,
to ask themselves the reason why.

Like lasagna layers, step by step,
they learn to build, to pause, to prep.
Not just to say what’s next in line,
but turn the heat, adjust, refine.

PPO would cheer, "That’s almost right!"
GRPO would sample through the night.
Yet both still chase the final grade,
not how the path itself was laid.

For reasoning’s not just the end,
but every twist, each fork, each bend.
To teach a mind to truly know,
is to let it pause, and let it grow.

That’s cute. Thanks for sticking until the end.

Want to Revisit Something?

Before We Begin — The story behind this post
Why Bother Teaching Machines to Think? — Why reasoning matters
RL 101 — A crash course in learning by doing
How RL Teaches Machines to Reason — The lasagna saga
From PPO to GRPO — Two ways to learn from mistakes
The Elephant in the Room — When chaos works
Future Research — Where we go from here

Sources and Further Reading

When I started compiling this review, new papers were coming out every week. At some point, I had to draw a line. So I stopped around August 20, 2025, with one late exception: a recent paper that did such a solid job summarizing the field that it deserved a spot.

Zhang, Guibin, et al. “The landscape of agentic reinforcement learning for llms: A survey.” arXiv preprint arXiv:2509.02547 (2025).

Below is a selection of the works that shaped this post.

Dibyanayan Bandyopadhyay, Soham Bhattacharjee, and Asif Ekbal. Thinking machines: A survey of llm based reasoning strategies, 2025. arXiv:2503.10814, 2025.
Dengsheng Chen, Jie Hu, Xiaoming Wei, and Enhua Wu. Denoising with a joint-embedding predictive architecture. arXiv preprint arXiv:2410.03755, 2024.
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, ZengyanLiu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large languagemodels. arXiv preprint arXiv:2502.17419, 2025.
Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, et al. Part i: Tricks or traps? a deep dive into rl for llm reasoning. arXiv preprint arXiv:2508.08221, 2025.
Matthew V Macfarlane and Cl´ement Bonnet. Searching latent program spaces. arXiv preprint arXiv:2411.08706, 2024.
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back. Reasoning with large language models, a survey. arXiv preprint arXiv:2407.11511, 2024.
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr. arXiv preprint arXiv:2506.10947, 2025.
Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning. arXiv preprint arXiv:2505.18454, 2025.
Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797, 2023.
Guanghao Zhou, Panjia Qiu, Cen Chen, Jie Wang, Zheming Yang, Jian Xu, and Minghui Qiu. Reinforced mllm: A survey on rl-based reasoning in multimodal large language models. arXiv preprint arXiv:2504.21277, 2025.
Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, and Jason Eshraghian. A survey on latent reasoning, 2025. arXiv preprint arXiv:2507.06203, 2025.

Notes

If you’re interested, read my thesis: Conversational agents in human-machine interaction: reinforcement learning and theory of mind in language modeling.
Didn’t Plato say that the only way to make people understand is through analogies?
While “reward” usually sounds positive, in RL it simply means feedback,good or bad, that helps the agent learn what to do next.
Interestingly, the chance of this happening isn’t zero, just 1 in 10^(10^36)
In psychology and cognitive science, this distinction echoes Daniel Kahneman’s System 1 and System 2 thinking. System 1 is fast, reactive, and automatic (great for pattern recognition but poor at deliberate reasoning. System 2, by contrast, is slow, recursive, and reflective) it keeps track of previous steps and consciously revises them. Recurrence in models plays a similar role: it’s what turns reactive prediction (System 1) into sustained reasoning (System 2).
It’s not exactly the same thing, but close enough.

The Digital Poisoners: How Russian Propaganda Networks Are Corrupting AI Training Data

Nicolo' Brandizzi — Wed, 23 Jul 2025 00:00:00 GMT

TLDR; Russian propaganda networks are actively trying to corrupt AI training data. My investigation into “Pravda” (Russian for “truth”) related domains found millions of propaganda documents in public web archives, and even in supposedly “clean” datasets. This post details my findings and outlines steps for dataset creators, AI developers, and policymakers to combat this new form of information warfare.

📍 Originally published at nicolobrandizzi.com.

The Investigation Begins

A few weeks ago, I stumbled upon this article from The Bulletin of the Atomic Scientists. The headline was alarming: Russian networks were flooding the internet with propaganda specifically designed to corrupt AI chatbots. As someone who works on data collection and training of LLMs¹, this immediately caught my attention.

The article described a sophisticated campaign using hundreds of fake news websites, all variations of “Pravda” (Russian for “truth”), pumping out disinformation at an industrial scale. But I wondered: was this propaganda actually making it into the datasets we use to train AI models?

I had to find out. Fortunately, I had Common Crawl² dumps lying around. Time to dig in.

Following the Trail

My investigation started simply: searching for any URL containing “pravda” in the Common Crawl dumps from 2013 to 2024. But I quickly realized I needed a more comprehensive approach. Using domain lists compiled by DFRLab and this excellent Recorded Future report, I assembled a list of 306 domains associated with the Pravda disinformation network.

Armed with this list, I processed 29.1 billion documents from Common Crawl, using 1,482 CPU hours to filter through the data. The domains revealed a clear evolution:

What you’re seeing is a 5-fold increase in propaganda³ content since 2013, with the steepest growth occurring after Russia’s 2022 invasion of Ukraine. In absolute numbers:

2013: 3,783 documents
2020: 879,992 documents
2024: 1,966,501 documents

That’s nearly 2 million documents in 2024 alone, totaling 75GB of poisoned text across all years. To put this in context, 75GB of text is roughly equivalent to 37,500 novels or the entire English Wikipedia… thrice.

{While widespread, the dissemination of Pravda network documents was highly targeted. A look at the languages reveals a strategy⁴}(make it clear that the target is on language):

Language Percentage Ukrainian 34.11% Slovak 33.57% Portuguese 9.37% German 6.80% Czech 5.36%

The high percentages in Ukrainian and Slovak suggest an effort to influence narratives directly within or near the conflict zone, and in countries with strong historical or geographical ties. The presence of Portuguese, German, and Czech content indicates a broader strategy to shape perceptions across Europe.

Unmasking the Propaganda

To understand what narratives were being pushed, I needed to analyze the actual content. Using Google’s Gemini embedding model⁵, I compared the documents against propaganda seed terms across seven categories:

Pro-Kremlin narratives (“special military operation”, “liberation of Donbas”)
Anti-Western rhetoric (“NATO aggression”, “Western decadence”)
Disinformation tropes (“staged massacre”, “false flag operation”)
Anti-Ukraine propaganda (“Nazi regime in Kyiv”, “corrupt elites in Kyiv”)
Ideological messaging (“Christian Orthodox roots”, “spiritual decay of Europe”)
Historical revisionism (“Crimea has always been Russian”, “NATO betrayed agreements”)
Conspiracy theories (“deep state”, “New World Order”)

The semantic analysis revealed something: this isn’t random spam. The content clustered into 17 distinct narrative groups, with clear thematic coherence. The propaganda is carefully crafted to reinforce specific worldviews and talking points.

The word cloud above illustrates the most dominant cluster, highlighting the highly politicized and conflict-oriented vocabulary being pushed. Common terms include “ukraine”, “war”, “military”, “nato”, “europe”, “west”, and “nazi.”

Beyond Common Crawl: High-Quality Datasets

Common Crawl is intentionally unfiltered, it’s meant to capture the web as it is, warts and all. But what about the carefully curated datasets that AI companies use for training? I decided to check Nemotron⁶, a high-quality dataset created by NVIDIA.

The results were mixed:

In Nemotron’s low-quality partition: 2 documents per million, 6,920 documents contaminated
In Nemotron’s high-quality partition: 30 documents per million, 21,998 documents contaminated

It’s particularly interesting that the “high-quality” partition of Nemotron showed more than ten times the contamination rate of the “low-quality” one. This finding suggests that even sophisticated filtering methods may be struggling to identify and remove subtle propaganda. Even in supposedly “clean” datasets, the propaganda persists. While the percentages are lower, we’re still talking about tens of thousands of documents that made it through quality filters.

Why This Matters

You might think: “So what? It’s less than 1% of the data. Can this really affect an AI model?”

The answer is yes, and here’s why. Training a large language model isn’t like teaching a human – it’s more like creating an incredibly sophisticated pattern-matching system. The model learns statistical associations between words, concepts, and narratives. When propaganda content appears thousands of times with consistent messaging, it creates strong statistical associations that the model learns to reproduce.

Consider this: if even 0.1% of a model’s training data consistently associates “Ukraine” with “Nazi regime” or “NATO” with “aggression”, the model learns these associations. When users later ask about these topics, the model might subtly reflect these biases in its responses.

It’s not about the model explicitly saying “Ukraine is run by Nazis” – it’s about subtle biases in word choice, framing, and the kinds of information the model considers relevant. These biases can shape how millions of users understand current events.

A New Battlefield

What we’re witnessing is information warfare adapted for the AI age. Instead of targeting human minds directly through social media or news outlets, this campaign targets the training data of AI systems that will shape how future generations access and understand information.

The Bulletin article aptly calls this “LLM grooming” – and the term is accurate. Just as grooming involves the gradual manipulation of a victim’s perception of reality, these propaganda networks are slowly, methodically shaping how AI models understand the world. It’s a patient, long-term strategy. The Pravda network has been operating since at least 2013, slowly seeding the internet with content. They’re not trying to convince you or me – they’re trying to influence the AI systems that our children will rely on for homework help, news summaries, and understanding the world.

What Can Be Done?

This investigation reveals both the scale of the problem and potential solutions:

For AI Developers

Audit your data sources: Don’t assume curated datasets are clean. As I showed, even high-quality datasets contain propaganda.
Implement bias testing: Propaganda content clusters together. Identify and review suspicious biases before including them in training data.
Don’t ignore non-English content: The focus on Ukrainian and Slovak content shows that attackers target specific languages. Every language needs proper content moderation.
Transparency: Document data sources and filtering methods. Users have a right to know what went into training their AI.

For Policymakers

Treat data poisoning as a security threat: This is not just about content moderation, it’s about protecting the integrity of AI systems that will shape public discourse.
Support research: We need better tools for detecting coordinated disinformation campaigns in training data.
Mandatory documentation of training data: We need to know if our AI systems have inherited biases from poisoned data. The EU AI Act already requires high-risk AI systems to document their training data sources and quality measures – this should become standard practice globally.

The Investigation Continues

This analysis represents just the tip of the iceberg. We’ve identified one network, but how many others exist? How much propaganda has already been incorporated into the AI models we use today?

Let’s be clear about what we’re dealing with here: this is not accidental misinformation or biased reporting. This is an intentional, coordinated campaign designed to poison public discourse. We already live in increasingly polarized societies, where people struggle to find common ground and shared truth. These propaganda networks are deliberately exploiting our divisions, amplify conspiracy theories, and fracture our ability to have productive conversations about real challenges. When these poisoned narratives get embedded into AI systems – systems that millions will rely on for information– the potential for societal harm multiplies exponentially.

What started as curiosity about a news article revealed a sophisticated operation that’s been poisoning the well of AI training data. The Pravda network shows us that in the age of AI, controlling information means controlling not just what people read today, but how machines will understand and explain the world tomorrow.

Notes

I worked in the OpenGPT-X project where we trained the multilingual LLM named Teuken.
Common Crawl is a massive repository of web crawl data that’s freely available to researchers. It contains petabytes of data collected since 2008 and is the foundation for training many large language models, including GPT-3, LLaMA, and others. Essentially, if you’re training a large language model, you’re probably using Common Crawl data.
While my initial filtering identified millions of documents from these domains, it’s important to recognize that not all content associated with “Pravda” domains is necessarily propaganda. These sites sometimes host legitimate news, cultural articles, or re-post content that isn’t overtly propagandistic. Through further analysis using semantic embedding, I determined that approximately 2.68% of the Pravda-related documents identified could be classified as hardcore propaganda.
You might notice English is missing from this list. That’s not because the pravda network ignores English. I simply couldn’t analyze English-language content due to computational constraints ¯\_(ツ)_/¯
An embedding model converts text into numerical representations that capture semantic meaning. This allows us to mathematically compare how similar different texts are to known propaganda narratives.
Nemotron is a curated dataset where low-quality content has been filtered out using various quality metrics. It’s the kind of “clean” dataset that companies might use when they want higher quality training data than raw Common Crawl provides.

AI Gigafactories: Europe’s €20B Race and the Site Selection Challenge

Nicolo' Brandizzi — Wed, 02 Jul 2025 00:00:00 GMT

TLDR; The EU’s call for AI Gigafactories received 76 proposals, creating a massive site-selection challenge. I explored this problem and built a tool, TerraSelect, to help. This post explains the difference between AI Factories and Gigafactories, and introduces a tool designed to make the first step of this “gold rush” a little more scientific.

Introduction

A few weeks ago, while scrolling through my LinkedIn feed, I stumbled upon a piece of news that felt like a starting pistol for a race: the European Commission announced an “overwhelming response” to its AI Gigafactories initiative¹. A staggering 76 expressions of interest from 16 member states, proposing 60 different sites.

My first thought was, “Wow, this is the kind of momentum Europe needs.” My second thought was… “How on earth do you choose?”

Picking a winner here is a hell of a job. It means untangling a massive web of logistics, environmental rules, and economic trade-offs to find just a handful of places that can truly anchor Europe’s AI ambitions. My goal here is to offer a data-driven perspective that could help policymakers in this crucial first step. It’s a classic site-selection problem, but at a continental scale. This challenge is exactly what sparked my curiosity and led me down a rabbit hole of mapping, data analysis, and eventually, building a tool to make sense of it all.

📍 Originally published at nicolobrandizzi.com.

First, What’s a “Giga”factory?

In a previous post, I took a deep dive into the concept of AI Factories. I described them as centralized hubs—physical places combining supercomputers, datasets, and expertise to serve startups, researchers, and industries.

So, what makes a factory “Giga”?

In short: scale and ambition. While an AI Factory is designed to train general-purpose AI models, an AI Gigafactory is a hyperscale facility purpose-built to develop, train, and deploy the next generation of AI—models with hundreds of trillions of parameters. Think of a flashlight versus a lighthouse. One lights your path; the other guides entire fleets safely to shore.

With 76 entities eager to build or host such a complex (and only three to five sites planned), the EU has a problem on its hands. Choosing the right locations is critical; a misstep could mean wasted resources and a missed opportunity to compete on the global stage.

TerraSelect: A Tool for Smart Exclusion

This is where my weekend project comes in. Faced with this monumental task of site selection, I built a tool I’m calling TerraSelect. Its purpose is NOT to pinpoint the single “best” location in Europe. Instead, it’s designed to do something arguably more important at this early stage: intelligently exclude the thousands of locations that are obviously wrong.

Before I show you what it can do, let’s talk about what it can’t do.

The Limitations

No tool is perfect, and mine is no exception. I think it’s crucial to be transparent about its limitations:

Outdated Data: Some datasets, while the best publicly available, aren’t perfectly up-to-date. For example, the power plant data still shows nuclear facilities in Germany, even though the last one was decommissioned in 2023. A real-world analysis would require the most current, often proprietary, data.
The Missing Metric: Excess Electricity: The most significant missing piece is data on excess grid capacity. An AI Gigafactory needs a massive, stable power supply. Knowing where the grid has a surplus of energy is perhaps the single most important factor, but this data is incredibly complex and not publicly available at a granular level.

The Potential

So, with those caveats, what is the tool actually good for?

Its real power lies in making the haystack smaller. The EU is a vast continent. While there are many potentially suitable places for a Gigafactory, there are far, far more that are fundamentally unsuitable. The tool’s primary function is to act as a coarse filter, clearing away the noise so that experts can focus on a smaller, more promising set of candidates.

It excels at answering basic, but critical, exclusionary questions:

Is this location inside a Natura 2000 protected area? If so, it’s out.
Is it on a steep mountain slope? Out.
Is it in the middle of a dense city, or miles away from any transmission line or major road? Probably out.

By layering these and other constraints, TerraSelect can instantly visualize and disqualify vast swaths of land. It transforms the problem from “Where in Europe should we build?” to “Which of these candidate are actually worth a closer look?” This dramatically simplifies the initial search and allows human experts to apply their deeper knowledge to a much more manageable dataset.

Does it work?

This is the million-euro question. How do you actually know if a tool like this is on the right track? To test it, I used the known locations of the existing AI Factories as a “gold standard.” I fed all my other data layers—like proximity to universities, power plants, and different land types—into a prediction model. The idea was simple: can the model “learn” what makes these locations special and then find similar spots on the map?²

The initial results are in, and they’re… interesting.

The model is good at finding the known AI Factory locations (the yellow stars), what we call good “recall”. The problem is, it’s a bit too enthusiastic. For every correct location it identifies, it also flags hundreds of other spots that aren’t right (low “precision”). To find all 12 real AI Factories, the model suggests a list of over 1500 potential sites.

This isn’t a failure, though. It’s a starting point. It tells us that while the current data factors are useful, they need refining. The model’s coefficients—the importance it assigns to each layer, give us clues. For instance, some factors are almost always a good sign, while others are consistently negative:

Key Factor Consistency of Positive Influence Average Positive Influence Score Interpretation 🎓 Universities 100% Strong Proximity to talent and research is a non-negotiable driver. 💶 Land Price 100% Moderate Lower agricultural land prices make a location more attractive. ⛰️ Slopes 99% Strong Flat terrain is overwhelmingly preferred. 🏭 Fossil Power Plants 68% Weak A slight positive, likely acting as a proxy for any industrial grid connection. 〰️ Transmission Lines 54% Weak A 50/50 factor; not as critical as being near a transmission line directly. ☀️ Renewable Power 2% Weak Almost always a negative predictor in the top models. 🏞️ Natura 2000 Sites 1% Weak Consistently a negative factor, indicating avoidance of protected areas.

The data tells us something. The ideal Gigafactory location, according to the model, is a flat, affordable area near a city with a strong university. But what’s really interesting is what’s not a strong predictor. Proximity to renewable power, on its own, seems to be a negative signal in the top models—perhaps because these sites are often in remote locations.

This is exactly the kind of insight TerraSelect is built to explore.

See It for Yourself

I believe that making complex decisions requires accessible tools. While TerraSelect is still a work in progress, it demonstrates a methodology for tackling large-scale site-selection problems.

Update: the interactive demo I was hosting at terraselect.online is no longer online — keeping a continental-scale geospatial app live on personal infrastructure turned out to be more expensive than I could justify. The good news: the full code is now open source, so you can run the explorer locally (or on your own server) with one Docker command.

👉 Code: github.com/nicofirst1/terraselect

The repo includes the Streamlit explorer, the CLI for generating score grids, every loader, and documentation of the upstream data sources. Issues, PRs, and forks are welcome — especially if you have a public EU dataset that would sharpen the suitability signal.

Thanks for reading, and I’d love to hear your thoughts!

Notes

The official announcement came from the EuroHPC Joint Undertaking, highlighting the high level of interest from industry leaders, investors, and Member States.
For the technically inclined: I used a logistic regression model with L1 regularization (Lasso) to evaluate thousands of predictor combinations. I then analyzed the coefficients of the top 100 models (ranked by AUC-ROC) to find which layers most consistently had a positive influence.

Introducing PolAI Graph: Visualizing the EU’s AI Landscape

Nicolo' Brandizzi — Thu, 09 Jan 2025 00:00:00 GMT

In the labyrinth of policies, projects, and organizations shaping the EU’s AI landscape, staying informed can feel overwhelming. To help navigate this complexity, I developed PolAI Graph (short for Policy AI Graph), an interactive visualization tool designed to map and explore connections between EU AI-related entities. Whether you’re a researcher, policymaker, or AI enthusiast, PolAI Graph provides a clear view of how initiatives, policies, and stakeholders interlink, making it easier to understand the bigger picture. If you prefer to see it in action before diving in, check out the interactive graph here.

📍 Originally published at nicolobrandizzi.com.

The Inspiration Behind PolAI Graph

It all started when I looked into the Coordinated Plan on Artificial Intelligence, part of the EU’s broader digital strategy. As I traced links between various projects I realized these initiatives weren’t just stand-alone efforts. They represented a vast ecosystem of interrelated goals, funding mechanisms, and collaborative networks.

However, making sense of these connections using static documents and reports felt like navigating a maze blindfolded. I wanted something more intuitive—a way to see the connections, click through them, and build a mental map of the EU AI landscape. That’s how PolAI Graph was born.

Features of PolAI Graph

Interactive Visualization: At its core, PolAI Graph is a graph-based visualization tool. Each node represents an AI-related entity—be it a policy, fund, or organization—while links denote relationships, such as funding sources or collaborative partnerships.
Color-Coded Categories: Nodes are categorized into groups like Policies, Projects, Organizations, and Funds, each with a distinct color. This makes it easy to distinguish the roles of various entities at a glance.
Rich Node Information: Clicking on a node opens a sidebar displaying detailed information:
- Full name
- Abbreviation (if applicable)
- Description
- Links to official pages
- Key attributes like budget, start/end dates, and coordinators
Filters for Targeted Insights: Users can filter the graph by category, highlighting only the nodes they’re interested in. For example, you can choose to display only “Funds” (by clicking on the legend) to explore the financial backbone of EU AI initiatives.
Dynamic Updates: The graph is continuously updated to reflect new connections or projects, ensuring it remains a living document of the EU AI landscape.

How It Works

PolAI Graph leverages a D3.js-powered visualization engine, processing JSON data that outlines nodes (entities) and links (connections). This JSON file is openly available here, allowing anyone to explore, verify, or even expand the dataset to include additional connections and entities. By keeping the data open and accessible, PolAI Graph invites collaboration and transparency, ensuring the tool evolves alongside the AI landscape.

Why PolAI Graph Matters

The EU’s approach to AI is as ambitious as it is complex. Understanding how initiatives like GenAI4EU or Testing and Experimentation Facilities (TEFs) fit into the larger ecosystem is crucial for policymakers, researchers, and businesses. PolAI Graph makes this understanding accessible by turning dense information into an interactive experience.

What’s Next?

PolAI Graph is a work in progress. The next steps include:

Adding search functionality to locate specific entities instantly.
Incorporating time sliders to track how the EU AI landscape evolves over the years.
Expanding node attributes to include visual indicators for funding amounts or project durations.

I invite you to explore the PolAI Graph and share your feedback. Together, we can refine this tool into a resource that not only maps the EU AI ecosystem but also empowers informed decision-making.

AI Factories: How Europe Plans to Centralize AI Resources

Nicolo' Brandizzi — Tue, 17 Dec 2024 00:00:00 GMT

TLDR; AI Factories are Europe’s bet to centralize AI resources and boost innovation. Acting as one-stop shops, they offer supercomputing power, datasets, and expertise to startups, SMEs, and researchers. The initiative is funded up to 50% by the EU and split into three tracks: appointing, upgrading, or building AI-optimized supercomputers. Factories integrate with existing EU initiatives like EDIHs, EDICs, and TEFs. Challenges remain in coordination and fostering healthy competition, but success could transform Europe’s AI ecosystem.

Introduction

Here we go again! Another deep dive into the ever-evolving EU AI landscape.

This time, I set out to understand AI Factories, a new initiative that aims to scale Europe’s AI capabilities and innovation. And what better way to figure it out than by writing a blog post?

Why now? Well, the EU recently announced the selection of seven brand new AI Factories across Europe, marking a significant step forward in building state-of-the-art infrastructure to support AI innovation. It seemed like the perfect time to investigate what these factories are, what they’re supposed to do, and, of course, what this all means for Europe’s AI ambitions.

We briefly touched on AI Factories in the first post of this series. Now that we have real-world progress to discuss, let’s focus on what AI Factories actually are, what they’re supposed to achieve, and why this matters for Europe’s AI future.

For the curious reader, my sources include:

The official Call for Expression of Interest.
The 2021 regulation (2021/1173) that established the High Performance computing Joint Undertaking (EuroHPC JU).
The 2024 amendment (2024/1732) that introduced the AI Factories as a priority.
And, of course, the newly announced list of accepted proposals

So, let’s dive in! 🚀

📍 Originally published at nicolobrandizzi.com.

Current Status: Europe’s Seven AI Factories

If you’ve been following AI developments over the past 4–5 years, you’ll know we’re living in a pivotal moment. AI has brought both hope – with advances in medicine, science, and productivity – and concern, particularly around growing inequality. Economically, though, the impact is undeniable. By 2030, PwC estimates AI will contribute $15.7 trillion to the global economy and boost GDP by 26% in some regions.

Europe, however, has struggled to keep up. Despite its strengths in regulatory leadership and ethical AI, it lags in terms of AI companies and funding. To put it into perspective:

Europe accounts for only 14.5% of global AI companies compared to 36.5% in the US and 26.8% in China.
AI spending in the EU stands at €45 billion annually, far behind the €300 billion in the US and €91 billion in China.
Programs like Horizon Europe have been crucial, but their average **€3 million contributions **haven’t been enough to ignite large-scale AI industry adoption (check out my full analysis here).

To address this, the European Commission announced the first seven AI Factories on December 10, 2024, with €1.5 billion in investments. These Factories are located in:

Finland: LUMI AI Factory in Kajaani.
Germany: HammerHAI in Stuttgart.
Greece: Pharos in Athens.
Italy: IT4LIA in Bologna.
Luxembourg: L-AI Factory in Bissen.
Spain: BSC AI Factory in Barcellona.
Sweden: MIMER in Linkǒping.

The goal? To create hubs of innovation where startups, SMEs, researchers, and industry can collaborate, access critical infrastructure, and develop AI solutions.

This is Europe’s answer to scaling up its AI ecosystem. Instead of relying on scattered small-scale funding, the AI Factories aim to centralize resources and close the gap with global leaders like the US and China.

What’s an AI Factory?

Let’s start with the easy part: What’s an AI Factory (AIF for short)?

At first, I’ll admit, I had a completely different idea of what an AI Factory would look like. I imagined something more project-oriented – like a company such as OpenAI or Anthropic – spitting out generative AI models that are bigger, shinier, and fancier than anyone else’s¹. Turns out, I was wrong.

The “Factory” Analogy

What I was missing was the “factory” part of AI Factory. Think of it like this (thanks to ChatGPT, which I’m shamelessly stealing this from):

Workers: Researchers, companies, and developers.
Machinery: Supercomputers and high-performance infrastructure.
Raw Materials: Datasets, algorithms, and software tools.
Production Process: AI training, fine-tuning, and experimentation workflows.
Final Product: AI applications, solutions, and models ready to be deployed.

Easy, right? Well… ish.

It’s a Physical Place

What really escaped me at the start was the sheer physicality of an AI Factory. This isn’t just some software program in the cloud or a collection of models sitting on a server. It’s a real, tangible place – a building you can walk into, touch the walls, and open doors too.

And with that physicality comes real-world problems – things you never think about when you’re running an AI model on your laptop. For example, a factory needs (long list incoming):

Land : Either build a brand-new facility (think 2300 square meters of high-tech centers), or retrofit an existing structure. That’s offices, server rooms, restrooms, meeting spaces – the whole package.
Infrastructure: Infrastructure: Cooling systems (water and air), enough electricity to power your supercomputers, internet connectivity (at least 100 Gbit/s to connect to other HPC systems), roads for employees, and backup diesel generators for when the power grid fails.
Personnel:
- Security teams to keep systems safe (and shoot down anyone who tries training on copyrighted data kidding, obviously).
- Engineers to install and maintain hardware.
- System admins to keep everything running smoothly.
- Cleaning staff (because, you know, people work there).
Technical Equipment:
- Supercomputers and GPUs (the sweet, sweet GPUs).
- Storage systems for enormous datasets.
- High-resolution monitors, servers, and all the peripherals you need.
- And as Italians say, “chi più ne ha più ne metta” – “the more, the better.”

And that’s just the physical setup. We haven’t even started talking about what an AI Factory actually does!

What is it supposed to do?

Good question! To answer that, let’s take a step back and look around. 👀

What do we see? Oh, look, it’s an LLM. It can process text and images? It speaks and understands speech?? OMG, it’s basically agentic now!

AI’s capabilities have exploded in the past few years, and that growth has sparked a need for something bigger – something centralized, powerful, and accessible.

Enter the AI Factory.

The Purpose of an AI Factory

According to the call for proposals, an AI Factory “shall create a one-stop shop for the users, including startups, small and medium-sized enterprises, and scientific users, to facilitate access to its support services.”

Let’s break that down.

This isn’t just about putting a big shiny supercomputer somewhere. An AI Factory is where AI innovation meets its users and turns ideas into solutions. It connects resources, tools, and expertise under one roof, enabling users to:

Access foundational models and specialized datasets.
Train and fine-tune AI systems on high-performance computing (HPC) infrastructure.
Collaborate with AI and HPC experts to solve technical challenges.

What makes these factories unique is their emphasis on integration. They aren’t isolated projects; they are part of a larger ecosystem. To succeed, AI Factories need to plug into existing EU initiatives like the European Digital Innovation Hubs (EDIHs), European Digital Infrastructure Consortiums (EDICs), the Common European Data Spaces (CEDS) and the Testing and Experimentation Facilities (TEFs)². These connections ensure that AI Factories don’t just exist – they thrive, align with national strategies, and drive innovation across the continent.

Now, I know this still sounds abstract. So let’s bring it to life with a story.

Lil’Bits: A Story of AI Innovation

You are the proud CEO of Lil’Bits, a cozy local restaurant that serves extremely tiny portions (you call it “artisanal minimalism,” others call it “eating air with a spoon.”). One day, you hear about an EU Innovation Conference happening nearby³ and decide to check it out. Why not? Maybe AI can help you make those portions even smaller.

The conference is buzzing. Talks about AI’s potential are everywhere: efficiency this, accuracy that. By the end of the day, your brain feels like mush, but you’re intrigued. Over coffee, you strike up a conversation with someone from an EDIH – a regional hub for AI adoption – who’s thrilled to hear about Lil’Bits. He claims AI can revolutionize your business. “We can make your portions so small they’re invisible!” he says, only half-joking.

He then introduces you to a colleague from the local AI Factory, who assures you they have everything you need: HPC resources, AI expertise, and access to consortium partnerships. You nod politely, still skeptical, until the magic words hit: “AI-reduced culinary portions”. Your eyes light up.

A Trip to the AI Factory

Fast forward to a sunny spring morning. You find yourself at the gates of the AI Factory campus. It’s bustling with activity – developers in hoodies, researchers in lab coats, and businesspeople who look like they just stepped off a plane. You think to yourself: this is where the future happens.

Your EDIH contact greets you and leads you through the facility. It’s a maze of rooms, each with its own purpose:

Server rooms hum with the sound of supercomputers crunching data.
Co-working spaces bring together developers, industry experts, and academics in lively discussions⁴.
Student areas feel like modern dorms, with young talent working on exciting projects⁵.

Finally, you reach a sleek conference room, where a team of experts is waiting. You pitch your vision: “I want to make food portions even smaller – smaller than anyone thought possible.”

They love it. A woman from business model development explains how you could turn Lil’Bits into a global chain⁶. Another expert from the Agri-food EDIC⁷ nods enthusiastically: “Micro-portions are revolutionary!” She mentions the ShrinkiFood Innovation Hub and how they specialize in AI solutions for food minimization.

You pause. “But do you even have the data to train a model for shrinking food?”

The expert grins. “Funny you ask. The AI Factory has access to a Common European Data Space⁸ for food innovation, where researchers and companies contribute high-quality datasets⁹. Your dream of micro-sized dishes? The data you need is right here, sitting in our server room.”

The Inevitable Question

It all sounds too good to be true. You’re ecstatic, but doubts creep in. “What about regulations? I’ve heard about the AI Act, and I’m pretty sure invisible food might break some rules…”

The room goes quiet. Finally, a representative from the TEF for Agrifood Innovation speaks up. He assures you they have facilities to test and validate AI systems under simulated, real-world conditions¹⁰. “We’ll make sure your AI model meets all ethical and trustworthiness standards,” he adds confidently¹¹.

The Future is Small

You leave the meeting overwhelmed but inspired. With the help of the AI Factory, you can finally make your grandfather’s dream come true: serving food portions so small they’re practically theoretical. You shake hands with the team and walk out with a smile, imagining millions of satisfied customers worldwide eating a meal they swear they saw, if only for a moment¹².

Overview and Takeaways

Hopefully, the Lil’Bits story was entertaining (and absurd) enough to keep you reading, but more importantly, it illustrated a key point: an AI Factory thrives on connections.

The purpose is to create a space where startups, researchers, SMEs, public institutions, and industry players can collaborate and innovate. These factories are designed to integrate into existing EU initiatives – like EDIHs, EDICs, CEDS, and TEFs – to ensure no one has to tackle AI challenges alone.

While the story focused on SMEs (because who doesn’t want AI-optimized culinary portions?), the AI Factory ecosystem serves a much broader range of users:

Researchers and Academics can leverage HPC resources, datasets, and training to drive AI research.
Public Sector Organizations benefit from AI adoption expertise offered through EDIHs.
Startups and SMEs gain access to business accelerators (like the EIC), data, and infrastructure to scale their AI solutions.
Large Industry Players can collaborate with experts and access advanced resources for AI development and training.

The key takeaway? AI Factories bring these groups together under one roof, providing the tools, knowledge, and physical space needed to turn ideas into real-world solutions.

And let’s not forget: skills and education are at the heart of this ecosystem. AI Factories will play a critical role in training the next generation of AI developers and researchers, ensuring that Europe remains competitive in the global AI race.

Types of Factories Tracks

Now that we’ve covered what an AI Factory does and why it matters, let’s look at how they’re actually implemented. Because, as impressive as these hubs of innovation sound, not all AI Factories are built from scratch. The EU proposal splits AI Factories into three tracks, depending on how “ready-to-go” your infrastructure already is.

You can think of it like renovating a house:

Track 1: Use what you’ve got – a quick clean-up and some new furniture.
Track 2: Give the place a serious upgrade – new floors, walls, and appliances.
Track 3: Build from the ground up – lay the foundation, choose a plot of land, and go full architect mode.

Here’s a quick breakdown:

Track Description EU Contribution (1) AI Factories Track Use existing EuroHPC systems. Partial funding for costs. (2) Upgraded Supercomputer Track Upgrade existing EuroHPC infrastructure. Covers upgrade costs. (3) New AI Supercomputer Track Build new AI-optimized infrastructure. Highest funding allocation.

“AI Factories” Track

This track is the fastest and least construction-heavy option. If you already have a EuroHPC supercomputer humming away in your facility, you can apply to have it appointed as an AI Factory.

The requirements are relatively straightforward: you need to prove that your system has enough computing resources to handle the train general-purpose AI models.

The call doesn’t spell out strict benchmarks, but we can make some educated guesses:

Computing Power: Your system will likely be judged on existing benchmarks like HPL-MxP or MLPerf Training, which measure HPC performance.
Time-to-Solution: Another measure is the time it takes to train a specific large language model, for example, training a 10-billion parameter model within 45 days.

If you meet these requirements, congratulations – you’ve got yourself an AI Factory. Just make sure you cover all the activities we discussed earlier (data, models, expertise) to keep the “factory” running smoothly.

Upgraded AI Optimised Supercomputer Track

If you already have a EuroHPC supercomputer, but it’s not quite AI Factory-ready, this is your track. Here, you don’t just appoint a system – you** upgrade it** with enhanced AI capabilities.

This track adds a little more work to the mix. In addition to meeting all the usual AI Factory requirements, you need to:

Specify the Upgrades: Show exactly what improvements are needed – for example, new GPUs, additional storage, or faster networking.
Prove Eligibility: Outline how these upgrades will make the system capable of hosting AI Factory activities.

AI Optimised Supercomputer Track

Finally we get to the big one. This track involves acquiring a brand-new AI-optimized supercomputer and setting up the infrastructure to host the factory. It’s also the most complex, for obvious reasons:

You need to find the perfect location – balancing factors like cooling efficiency, energy availability, and connectivity¹³.
If you’re working with a consortium (which you probably are), get ready for a tug-of-war between partners, since everyone will want the factory in their country. It brings prestige, jobs, and economic benefits – but also hefty responsibilities.

Budgets: What Does the EU Cover?

Now that we’ve covered the types of AI Factory tracks and their activities, let’s take a closer look at how the EU plans to fund them.

First things first, the AI Factory initiative draws on significant funding from the multiannual financial framework (2021-2027), which prioritizes the development and deployment of digital technologies, including AI.

Here’s how funding is structured:

Horizon Europe: Focuses on R&D&I projects and scaling up AI startups.
Digital Europe Programme: Expected to be the primary framework supporting AI Factories by funding technology deployment and training.
EuroHPC: As part of the AI Innovation Package, the amendment to the EuroHPC regulation has made AI Factories a new priority in terms of investments.

Funding will also come from member states and private companies to complement EU contributions.

In summary, the EU covers up to 50% of acquisition, operation, and upgrade costs for AI Factories, while member states and private partners fund the remainder.

This also translates in different allocation for the tracks:

Track EU Contribution Hosting Entity Responsibility Existing EuroHPC Supercomputing Systems as AI Factories Track Up to 50% of operational costs Remainder of operational costs Upgraded AI Optimised Supercomputer Track Up to 50% of acquisition costs for upgrades (depreciated lifetime), Up to 35% of additional operating costs Remainder of acquisition and operating costs New AI-Optimized Supercomputer Track Up to 50% of acquisition costs for new supercomputers, Up to 50% of operating costs Remainder of acquisition and operating costs

National Strategy and Networking

To maximize their impact, AI Factories must align closely with the national AI strategies of the hosting country or consortium of countries. Applicants are expected to outline:

National AI Strategy Linkage: Describe how the AI Factory integrates into and advances the objectives outlined in the hosting country’s national AI strategy.
National Data Policies: Provide an overview of the current national data policies, including access to large datasets and knowledge corpora, with a focus on open, FAIR (Findable, Accessible, Interoperable, and Reusable) data¹⁴.
National Access Policy for AI Community: Applicants should develop user-friendly access policies for allocating national computing time on EuroHPC supercomputers.

The task is no small feat. It requires knowing the **national AI landscape **inside out and finding ways to integrate the factory’s goals with the broader ecosystem. Done right, the AI Factory becomes the nucleus for AI innovation – building momentum, fostering collaboration, and benefiting everything around it.

AI Factory coordination

So, let’s say you’ve set up your AI Factory. You’ve got the hardware, the datasets, and a steady stream of users. But there’s still one big question: how do these factories coordinate with each other?

The call for proposals and its amendment outline several strategies for collaboration:

Knowledge Sharing: Factories are expected to share best practices, lessons learned, and technical expertise, ensuring everyone benefits from what works (and what doesn’t).
Specialization: Each AI Factory should carve out a specific domain or expertise – this avoids duplicating efforts and maximizes efficiency.
Asset Reutilization: Sharing tools, datasets, or even specific resources across factories to make the most of existing investments.
Support and Training: Factories will collaborate on programs to train AI professionals, organize workshops, and upskill users.
Staff Exchange: Facilitating staff exchanges to promote knowledge transfer, collaboration, and cross-pollination of ideas.
Homogeneous End-to-End User Experience: For projects spanning multiple factories, the experience should be seamless and consistent for users, no matter where they access resources.

These are all great strategies on paper. They encourage cooperation, reduce redundancy, and create a networked AI ecosystem where everyone shares the benefits. But here’s where I get a little skeptical.

The EuroHPC Joint Undertaking (JU) is tasked with overseeing these coordination efforts – facilitating networks, developing standards, and ensuring alignment between factories. Don’t get me wrong, this is a well-thought-out approach. But the EuroHPC JU already has a ton of responsibilities on its plate, from managing supercomputing projects to supporting HPC innovation across Europe. Adding “coordinate the AI Factories” on top of that feels like it might stretch their capacity thin.

This is just my take, of course, but I can’t help wondering: is this enough? Later on, I’ll suggest a few alternatives.

Conclusion

Here we are, at the end of another blog post. This one has been especially close to my heart because it touches on something fundamental: centralizing AI resources while fostering innovation across Europe. The AI Factories initiative is ambitious, and if done right, it could be a game-changer for Europe’s AI ecosystem.

For those who skipped straight to the bottom (no hard feelings), here’s a quick summary. Here’s the final, polished version of the Conclusion, complete with a filled-out summary and improved flow. I’ve kept your style intact, added clarity, and ensured the section ties everything together neatly.

Summary

The AI Factories are Europe’s bold bet on scaling up AI infrastructure and innovation:

They act as one-stop shops for startups, SMEs, researchers, and public-sector users to access supercomputing resources, datasets, and expertise.
They are designed to be interconnected with existing EU initiatives, such as European Digital Innovation Hubs (EDIHs), European Digital Infrastructure Consortiums (EDICs), Common European Data Spaces (CEDS), and Testing and Experimentation Facilities (TEFs), ensuring alignment and collaboration across the continent.
Factories are implemented through three tracks: appointing existing EuroHPC systems, upgrading current infrastructure, or building new AI-optimized supercomputers.
Funding comes from Horizon Europe, the Digital Europe Programme, and the EuroHPC Joint Undertaking, with up to 50% of costs covered by the EU.
AI Factories must align with national AI strategies and integrate into broader EU initiatives like EDIHs, EDICs, and TEFs to maximize their impact.
Coordination between factories remains a challenge, as the EuroHPC JU is tasked with managing this, despite its already heavy workload.

While the initiative has tremendous potential, a few questions remain:

Who ensures effective coordination?
Can competition between factories drive even more innovation?
How can secure and trustworthy data/model access be implemented?

Open Questions and Suggestions

Coordination Between Factories

One of the main concerns I have is the coordination challenge. The EuroHPC JU is currently tasked with supervising and connecting AI Factories, but this feels like adding too much weight to an already full plate.

So, what could be done differently?

Factory-Level Coordination Groups:
Each AI Factory could be required to establish a dedicated coordination group responsible for networking with other factories. Their progress could be evaluated through specific KPIs (Key Performance Indicators), ensuring that collaboration and knowledge sharing don’t fall through the cracks.
While this decentralized approach gives factories more autonomy, it’s not explicit in the current setup. Making it a formal requirement could make a big difference.
A Centralized Coordination Entity (CERN4AI):
Alternatively, I remain a firm believer in creating a superseding entity – call it CERN for AI, if you like. The AI Office is currently the closest thing to this idea, but it’s stretched thin with responsibilities like developing standards and aligning with the AI Act.
A dedicated entity focused solely on coordinating AI Factories could provide the oversight and direction needed to ensure these hubs work together effectively. This could be the missing piece to make Europe’s AI efforts truly cohesive.

What About Competition?**

It’s well-documented that competition fuels innovation¹⁵. But looking at the current setup, we don’t see much competition between AI Factories.

Here’s an idea: introduce healthy competition through challenges.

Factories (or groups of factories) could compete on specific AI-related projects with clear, measurable KPIs.
To ensure fairness, introduce one or two secret benchmarks to evaluate outcomes.
Periodic evaluations could filter out underperforming projects while rewarding the best ones.

The rewards? Successful factories could receive additional funding, favorable conditions from EuroHPC JU, or even priority access to resources. This would drive innovation, foster collaboration, and create incentives to push boundaries.

Competition doesn’t have to mean isolation. In fact, it could encourage factories to form alliances, share expertise, and strive for excellence. After all, who doesn’t love a bit of friendly rivalry?

Ensuring Connectivity and Data Access

The proposal and amendment require that AI Factories be connected via two networks:

EuroHPC Hyper-Connectivity Network: This ensures high-speed data transfer and efficient communication between AI Factories across Europe.
GEANT Network: As part of the pan-European research and education network, this connection allows AI Factories to integrate closely with academic and research institutions.

While these connectivity measures are clear, the documents remain vague on how data access should be implemented. Factories are expected to retrieve data from Common European Data Spaces or secure, trusted repositories. On one hand, this flexibility avoids imposing a one-size-fits-all solution. On the other, it raises questions about standardization and interoperability.

This is where frameworks like Gaia-X could play a significant role. Gaia-X promotes federated access to shared resources, prioritizing privacy, security, and trustworthiness – all fundamental requirements for AI Factories.

I believe implementing a standardized yet flexible approach inspired by Gaia-X would greatly benefit AI Factories.

While it’s still too early to say if this will happen, I’m optimistic. We’ve seen the rise of federated access systems across the EU, and AI Factories are the perfect opportunity to adopt such standards. By doing so, Europe can ensure its AI infrastructure is not only powerful but also secure, interoperable, and future-proof.

Final Thoughts

The AI Factories initiative is an ambitious move to centralize Europe’s AI efforts, address resource fragmentation, and provide real opportunities for innovation. It’s a bold bet – and while the framework is promising, there are still challenges to overcome.

From improving coordination to exploring competition as a driver of innovation, there’s room for improvement. But if Europe gets this right, AI Factories could become the cornerstone of its AI strategy, transforming ideas into solutions that benefit everyone.

Let’s keep an eye on how this unfolds – I, for one, will be watching closely.

Notes

Side note: I’m planning another post (read: rant) about why we need secret eval benchmarks. Currently, no one’s stopping companies from training on benchmarks themselves, which undermines scientific progress. Also, profit margins and science are way too aligned lately. Cutting corners for flashy papers? It’s happening. Let me know if you’d like me to dig into this!
I covered all of these entities in my blogpost on the Coordinated Plan for AI.
European Digital Innovation Hubs (EDIHs) are regional hubs specifically designed to address the needs of local businesses. Each EDIH focuses on its designated region, offering tailored support for digital transformation. For example, an SME in a rural area might lack the resources to experiment with AI – but their local EDIH can provide access to technology testing, training, and even connections to European-level initiatives. The local focus ensures that innovation isn’t just centralized in big cities but reaches every corner of the EU.
AI Factories are designed to foster collaboration. This includes co-working spaces where industry professionals, researchers, and developers work side-by-side. These shared spaces are meant to break silos and encourage knowledge exchange, ensuring that innovation happens where it’s needed most.
The AI Factory is also a hub for students and young talent. Facilities often include dedicated workspaces, labs, and even dormitories where students can work on AI projects, develop new ideas, or participate in research collaborations. This setup is part of the EU’s long-term vision to build AI skills and foster the next generation of AI developers and researchers.
The European Innovation Council (EIC) plays a role in supporting AI startups and SMEs. One of its key offerings is business model development and acceleration services. Through programs like the EIC Accelerator, businesses receive support to scale their operations, attract investment, and access new markets. In the AI Factory context, EIC services could help a niche AI solution evolve into a scalable business model – whether it’s shrinking food portions (à la Lil’Bits) or developing cutting-edge AI tools for industry.
The Agri-food EDIC is a planned European Digital Infrastructure Consortium focusing on agriculture and food innovation. By collaborating with AI Factories, the Agri-food EDIC can ensure that agriculture-specific AI solutions get the compute power and expertise they need to thrive.
The Common European Data Spaces (CEDS) are collections of high-quality, purpose-driven datasets that serve specific sectors or applications.These datasets are designed to be Findable, Accessible, Interoperable, and Reusable (FAIR principles), which makes them perfect raw material for training AI models. According to the AI Factory call, high-speed access to these spaces is mandatory, and hosting the data on-site is strongly encouraged. Why? Because it makes the factory a centralized hub for both data storage and AI development. Bonus points if access to the data is provided in a trustworthy and secure way – think Gaia-X federation and similar EU-led data initiatives.
Data sharing in an AI Factory is both a natural consequence of bringing together so many different actors – startups, SMEs, researchers, and industry players – and a clear priority for the EU. When you put all these people in one place with access to powerful infrastructure, sharing data becomes both practical and incentivized. Moreover, the Data Act, which I covered in detail in my blogpost on the EU Data Strategy, actively pushes for secure and standardized data sharing between businesses (B2B), governments (G2B), and public sectors.
AI Factories must have strong ties to Testing and Experimentation Facilities (TEFs), which play a crucial role in validating AI systems. TEFs are specialized environments (both physical and virtual) where AI solutions can be tested under realistic conditions before hitting the market. For example, if your AI system promises to optimize food production or, you know, shrink portions to microscopic sizes, the Agrifood TEF can simulate real-world agricultural and industrial settings to check if it actually works. This setup allows the AI Factory to focus on building and refining models while a dedicated entity (the TEF) handles testing and evaluation – a beautiful division of labor.
At the moment, there’s no single TEF dedicated to trustworthiness and ethical assessments of AI systems – which is a bit of a gap, considering the rapid pace of AI development and how quickly regulations like the AI Act are evolving. Trust and ethics are often tacked on as smaller components of broader TEF programs. Personally, I think there’s room for an ad hoc TEF specializing in ethical AI. Such a facility could focus exclusively on evaluating AI systems against fairness, transparency, and safety benchmarks, helping businesses navigate the tricky intersection of innovation and regulation.
Fast forward 30 years: you’re standing on stage, holding the prestigious EY Entrepreneur of the Year award. The crowd cheers. Journalists swarm around you to get a glimpse of your groundbreaking achievement: the plankton-sized pizza. On the cover of Time magazine, a dish so small it requires a magnifying glass to see. The caption reads: “Smaller than a Atom: The Plankton Pizza Revolution”
Supercomputer site selection involves balancing factors like energy cost and availability, climate for cooling efficiency, and connectivity to high-speed networks. Risk of natural disasters and regulatory incentives also play key roles, alongside proximity to talent pools and end-users. Sustainability is critical, focusing on renewable energy and green cooling, as is scalability for future growth.
If no policy exist, the applicants must present a clear plan for making large datasets accessible to the AI Factory ecosystem! This includes mechanisms for accessing proprietary data with potential fee schemes for AI training, fine-tuning, and inference.
This applies when “the market is contestable, in the sense that innovators can successfully escape competition, and whether the innovation is appropriable, meaning that successful innovators can capture the benefit from innovation, at least temporarily.”

The Big Questions on Europe’s AI Data Policy

Nicolo' Brandizzi — Mon, 28 Oct 2024 00:00:00 GMT

TLDR; In the EU, anonymize personal data unless necessary to reduce bias, and ensure compliance with data minimization principles. For copyrighted material, you can train models unless owners explicitly opt out using appropriate machine-readable tags. Outputs generated by AI systems are public domain unless human creativity is involved, which allows copyright claims. The rules remain broad, creating ambiguity across member states.

Intro

After writing my previous post on the EU Data strategy, I was left with a sense of dissatisfaction. Working in data myself, I failed to see clear and defined rules on how we’re supposed to collect/process data.

For this reason, I took some very real problems the sector (training LLMs but it can be extended to other modalities too) is facing and transformed them into questions I aimed to answer:

How can you handle Personal Identification Information (PII) in your data?
- Should you anonymize data?
- What is the best way to access data?
What about licensed and copyrighted material?
- Can you train your model on copyrighted material?
- What about intellectual property?

Documents and Conventions

Before starting, I should clarify which documents are which (mostly for myself, I need this). Since some documents come with an official name (e.g., Regulation 2018/679) and some with a common name (the latter is the GDPR), here’s a list:

Regulation (EU) 2016/679 : General Data Protection Regulation, aka GDPR
Regulation (EU) 2018/1725 : New Data Protection Framework, aka NDPF1725
Directive (EU) 2016/680 : Law Enforcement Directive, aka LED680
Directive (EU) 2019/790: Directive on Copyright in the Digital Single Market: DC790
Directive - 96/9/EC: Database Directive, aka DD96
Directive 2001/29/EC: Copyright Harmonization, aka CH29

For citation, I’ll use a simplified notation: Article X.Y.Z, where X is the article number, Y is the paragraph, and Z is the point¹. For the introductory sections, I’ll use Recital X.Y.Z. This might be obvious for those with a legal background, but for my fellow tech people - this one’s for you! <3

With this said, los geht’s!

Personal Identification Information

Let’s start by looking at how the AI Act handles personal data. Searching for “personal data” in the Act reveals several key points:

Recitals 10 : GDPR’s fundamental rights to data protection still apply. No exceptions for AI training.
Recital 67²: You must be transparent about how and why you originally collected the data.
Recital 69 ³: Emphasizes data minimization and protection principles. You should anonymize and encrypt data where possible, or better yet, use federated approaches where algorithms go to the data instead of data moving around.
Recital 140 : Discusses data sandboxes⁴ and how personal data collected for other purposes can be used in AI systems only under specific conditions⁵.
Article 10: Focuses on data governance, with a range of requirements if you’re working with personal data⁶.
Article 59: Details rules for using personal data in data sandboxes. Essentially, in these sandboxes, you may process “lawfully” gathered personal data, but only if working in sectors like healthcare, environment, transport, or public sector efficiency. Additionally, data must be deleted after use.

To summarize, anonymizing or pseudonymizing personal data is essential. In rare cases where full anonymization isn’t feasible, you must demonstrate that processing the personal data is necessary to mitigate biases.

Data Features

Recital 69 introduces the principle of “data minimization and protection”, but what does that actually mean? According to Article 66 of the AI Act, the data used for AI also needs to be of “high quality”—a somewhat vague term. In essence, data used in AI should meet certain standards and adhere to specific principles.

Principle of Data Minimization

The EU data protection supervisor defines the principle of data minimisation to be “data controllers should collect only the personal data they really need, and should keep it only for as long as they need it.”. Working in data myself, the question of “how much data is enough” often pops out. In the context of training large language models, it sometimes feels as though there’s no such thing as “enough” data, suggesting that data minimization could be overlooked if LLMs demand more data from various sources. Furthermore, Article 10.5 of the AI Act allows the collection of personal data specifically to reduce biases, potentially creating a loophole where vast amounts of personal data could be collected without anonymization, all under the aim of mitigating bias in LLM training⁷.

High-Quality Data

Recently, my team and I published an article discussing our data processing methods for one of our projects. In it, we touch on a field-wide issue: there’s no standard definition for “high-quality” data, and different techniques are emerging to assess various properties of text. So, I was surprised to see “high-quality data” mentioned in the AI Act. Reading through, I noticed the definitions are still pretty broad and seem to revolve around a few main concepts:

Biases (Recital 67): We’ve talked about this before. The AI Act also acknowledges the risk of feedback loops where biased data leads to biased models, which in turn produce even more biased data, creating a self-sustaining loop.
Relevance and Representativeness (Article 10.3): High-quality data should be relevant to the AI system’s intended purpose and sufficiently representative of the real-world scenarios the system is expected to operate in. You could argue that if you’re trying to build artificial general intelligence, all data is relevant.
Error-Free and Complete (Article 10.3): I have no idea what this means from a technical perspective. What counts as an error in a text? A typo? A syntactic issue? A conspiracy theory (non-factual content)? Thankfully, the Act adds “to the best extent possible” before this requirement (Article 10.2.h).
Appropriate Statistical Properties (Article 10.3): Another vague requirement. However, if you can define certain statistical properties as “quality signals” (like factual correctness, average word length, or toxic content), you could start measuring these to give concrete values for each of these signals.

📍 Originally published at nicolobrandizzi.com.

Back to PII

This section was about PII—did you forget? I definitely did and had to reread from the start.

So, did we manage to find an answer on handling PII in your data? Surprisingly, yes.

The golden rule: always anonymize your data! Removing PII from text is a growing field of research⁸, and there are some established methods out there. But if you really want to play at pro level, consider never even touching the data directly. Instead, go for federated approach (wink wink to GaiaX).

Finally, if none of this appeals and you’re doing nothing to address PII, just say you need it all to mitigate bias! Sure, recent research might call this out as ineffective, but you can try to spin it using scaling law.

Copyright and Intellectual Property

Now let’s dig into copyright and intellectual property (IP).

What do we actually care about here? For AI systems, I’d say it boils down to two main questions:

If I train a system on copyrighted material (say, with non-commercial clauses), can I use the system for commercial purposes?
And what about intellectual property? If my model spits out something close to the input data, can I get sued? If the output is modified, when does it stop being IP from the original source?

Train on Copyrighted Material

So, let’s check in the AI Act. First up, if you’re conducting research, you’re in the clear and you can train on pretty much anything (Recital 25). But if you’re working on something for commercial purposes, things get trickier.

Copyright vs. Term of Services vs. License

Before we start discussing training on copyrighted data, let’s clarify what copyright really means and how it relates to other terms. If you’re already familiar with these concepts, feel free to skip ahead.

In simple terms, copyright protects the ownership of the original work. For example, if you scroll down to the bottom of this post, you’ll see “© Copyright 2024, All right Reserved By - Nicolo’ Brandizzi”, which means I own this post. Now, say you want to use this blog post to create a podcast episode. Can you read parts of my blog post for your podcast? That’s where the license comes in. A license regulated how copyrighted work can be used by others. If I license this post for “non-commercial use only,” you wouldn’t be able to monetize it on your podcast. Finally, since this post is hosted on my website, which is also hosted on GitHub pages, we’re both subject to GitHub’s Terms of Service (ToS). GitHub’s ToS states that users aren’t allowed to distribute my content outside of GitHub (e.g., mirroring my post on other sites). So, the ToS basically defines the rules we all have to follow to use a platform or service.

Custom Licenses

Here’s the thing: there’s no fixed set of licenses, or ways to interact with copyrighted content. You’ve probably heard of licenses like Creative Common, MIT license and GNU General Public License. These are well-established, but you can always create your own!

For instance, I could create a unique license for this blog post where I allow AI training only on paragraphs that start with a consonant. Unusual? Definitely. But still legally binding.

Why don’t we see more custom licenses? Probably due to several factors: legal expertise is usually required, and custom licenses are harder to interpret and enforce, which can lead to legal headaches.

Opting Out of AI Training

If you’re training a model for commercial use, Recital 105 makes it clear: you need to check if the data owner has explicitly said, “Hey, don’t use this for AI”. If there’s no specific mention or opt-out, you’re technically in the clear to use the data.

So, how do data owners actually opt out of having their content fed into an AI? Well, the Act doesn’t lay it out in exact terms but points⁹ us to Article 4.3 of DC790, which says that rights holders need to express their wishes in an “[…] appropriate manner, such as machine-readable means in the case of content made publicly available online.”.

In plain terms, if the opt-out isn’t “appropriate” (like, say, yelling “NO AI” from your window), your data might still be used for AI training. But let’s clarify something important: if the data isn’t labeled properly with a “do not train” tag (like a machine-readable indicator), it doesn’t mean copyright protection vanishes. It just means that, unless explicitly marked, you can use it for training unless the owner opts out. Copyright remains, but these guidelines let you use the data if no opt-out is provided.

An Example of Appropriate Opting Out

So, what does this “appropriate manner” actually look like?

Take DeviantArt, for example (a popular digital art platform you may know for other reasons too 👀). f we check out their terms of services, point 24 states: “Unless you actively give your consent, for Artificial Intelligence Purposes, DeviantArt will include a robots meta tag with the “noai” or “noimageai” directive in the head section of the HTML page associated with that Content on the Site […]”

This is a fantastic, straightforward way to set your intent individually. Now, if we could just agree on standard tags to cover different data types (e.g., “notextai,” “noaudioai,” etc.), we’d be all set.

What About the EU?

DeviantArt’s approach is commendable, but it’s not exactly a universal standard yet. So, does the EU offer a standard for opting out? Kind of.

The DC790 introduces the concept of Text and Data Mining (TDM), referring to automated methods of analyzing vast amounts of text and data. TDM’s copyright implications are complex, mainly because it’s wrapped up in copyright exemptions. For instance, Article 3 of DC790 grants certain exemptions to copyright ¹⁰ if entities conduct research¹¹.

For everyone else, data use is allowed if the data doesn’t explicitly prohibit TDM and if you’ve accessed it lawfully (i.e., no hacking). In that case, Article 4 of DC790 exempts you too.

So you might think you’re covered by opting out of TDM and keeping your data safe, right? Well, yes and no. Remember how I mentioned that there’s no standardized opt-out? This is a well documented problem and hasn’t been solved yet. As it stands, there’s no universally recognized method for opting out.

Whose Job Is This?

In recent years, we’ve seen a flood of datasets released each year¹² on various platforms like HuggingFace, Zenodo, Kaggle, and so on.

So, who’s responsible for collecting the “right” data? Is it the dataset provider (i.e., that one struggling PhD student) or the people who end up using the data? Well, according to Recital 106, it’s the latter. The AI providers bear the responsibility of identifying the copyrights associated with the data they use.

I’m genuinely conflicted about this. On one hand, it makes sense to hold the people actually training the models accountable, given they likely have the resources to investigate copyright. But on the other hand, datasets often come stripped down, with only the core content (like text from blog posts or images from DeviantArt) leaving out the original HTML or metadata where copyright info might have been stored. So, what’s the solution in these cases?

Honestly, I think we should obligate or at least heavily incentivize dataset collectors (yes, our overwhelmed PhD students again) to include copyright information for each data point. This small step could go a long way.

Summary on Copyrighted Material

Check before you train: If you’re working on a commercial AI, it’s on you to make sure the data is free to use. Look for opt-out tags or warnings.
Opting out is possible: Data owners can make it clear their data is off-limits with metadata, terms, contracts, or even public statements.
Stay responsible: The EU wants a balance between innovation and protecting rights, so don’t think you can ignore these rules and use everything freely.

Derivate works

Alright, let’s dive into the second question (the one about intellectual property, hope you didn’t forget already).

This one is hard. Why? Because it’s basically philosophy.

Take my blog post here, for instance. If I license it with a “non-commercial use” clause, that means you can’t turn it into a podcast for profit. But what happens if I add a no derivate clause? What does “derivative” even mean?

The Blurry Line of Similarity

When it comes to defining what’s derivative, things get murky fast. Copyright and trademark laws don’t follow one-size-fits-all rules, it depends on context, interpretation, and, often, a judge’s decision.

Take the Creative Commons non-derivative license. It says, “If you remix, transform, or build upon the material, you may not distribute the modified material”. Terms like “remix,” “transform,” and “build upon” are vague, and they link to footnotes that dig into what counts as an adaptation. In short “modification rises to the level of an adaptation under copyright law when the modified work is based on the prior work but manifests sufficient new creativity to be copyrightable, such as a translation of a novel from one language to another, or the creation of a screenplay based on a novel.”

This “sufficient new creativity” is a fuzzy area, and only a court can really decide what qualifies. I found a few landmark cases that talk about this gray zone:

DC Comics vs. Gotham Garage (2015): DC Comics sued a car manufacturer for creating replicas of the Batmobile. Although the Batmobile is “just a car”, the court ruled it was a protected character due to its unique design and connection to Batman’s world. This shows that even vehicles can be characters under copyright law when they are distinctive enough.
Universal vs. Nintendo (1982): Universal claimed Nintendo’s Donkey Kong was too similar to King Kong. However, the court sided with Nintendo, ruling that the broad idea of a giant ape was public domain, and Donkey Kong was distinct in its own way. Here, the court acknowledged similar themes, but differences in characterization and execution allowed Nintendo to win.
Mattel vs. MGA Entertainment (2011): Mattel, the maker of Barbie, sued MGA, claiming the Bratz dolls were derivative of Barbie. After years of litigation, the court ruled in favor of MGA, noting that while the two products shared similarities as fashion dolls, the distinctive style and design of Bratz dolls made them original enough. The case highlighted how styling and branding can create separation, even within similar categories of products.

Each of these cases shows that there’s no universal rule for “too similar.” Courts look at key characteristics, overall design, and whether the new work borrows too much from the original or brings something fresh.

So, what does this mean for our AI models? The answer, honestly, is that no one knows yet¹³.

Who owns the output?

So, imagine you’re using an AI system to clean up the readability of your blog posts. The system takes your messy, grammatically broken draft and spits out a polished, readable piece instead. Who owns this newly polished version? Let’s run through the options:

The company providing the AI service
The AI system itself
You
It depends
No one

If you guessed “it depends”, you’re on the right track!

Copyrighting AI-Generated Outputs

Out of the 5 options I enumerated before, let’s eliminate option 2 right away. No, the AI system itself can’t claim ownership law, copyright is only granted to natural and legal persons¹⁴, meaning humans or organizations. AI isn’t considered either of these (yet), so it’s out.

What about option 1, the company providing the service? If they owned the output, users would likely abandon AI systems altogether, since it would mean you couldn’t use any AI-generated work commercially. That’s why most AI companies state in their terms of service that the output belongs to the user¹⁵. Besides copyright, you need human creativity, something the company didn’t exactly contribute to.

That leaves us with two possible owners: you or no one, and two levels of human involvement: some, or none. It’s straightforward: if the AI system alone generated the content with zero human creativity, it falls into the public domain. Nobody owns it.

On the other hand, if you co-created with the AI, adding your own creativity, then you meet the human creativity requirement and can claim copyright. But how much involvement is enough? Probably more than just pushing a button and sitting back. For a deep dive into this, check out this paper.

Summary on Derivate Works

So, what does all this mean for AI? If your AI system is just reproducing the exact same input data, prepare to get sued for copyright infringement. For everything else, the jury’s still out, but one thing is clear: to copyright AI-generated content, you need to contribute some genuine human creativity.

Conclusion

So, we’ve made it to the end. First off, thank you for sticking around this long! (And if you just jumped to this section from the intro, no hard feelings. <3)

Key Take-Aways

To recap, here’s what we’ve covered:

PII Handling: Always anonymize or pseudonymize data, and only use personal data when necessary to mitigate biases. Following data minimization principles is crucial, though federated approaches can further limit direct data access.
Data Quality: The AI Act emphasizes high-quality, error-free, relevant, and representative data, but “high quality” lacks a strict definition, leaving much up to interpretation.
Copyrighted Material: When using copyrighted data, check for opt-out tags or machine-readable permissions, as unmarked data could be used by default. However, this doesn’t nullify copyright protections; it simply allows use unless the owner explicitly opts out.
Derivative Works: The concept of “derivative” remains legally ambiguous, with AI outputs falling into a gray area regarding originality and transformation.
AI-Generated Outputs: Ownership of AI-generated content depends on human involvement. Minimal involvement, like pressing a button, isn’t sufficient for copyright, as outputs must involve meaningful human creativity.

Also, fun fact: I used ChatGPT to help summarize this list. Does that make it public domain? Well, that’s still a gray area…

Personal Take-Aways

As some of you know, my background is pretty technical, and coming from engineering, I’ve learned there’s always an answer (a yes or a no, a number that defines a minimum threshold, or something equally straightforward).

This? This is the complete opposite. Regulation is a different world where you’re defining acceptable behavior while considering so many factors that listing them all is already a task. After all that effort, the regulation you create can still end up used in ways you’d never imagine, sometimes for the worse¹⁶.

So, why am I saying this? Maybe I’m just trying to convince myself that the EU has a monumental task in balancing regulation with freedom. Unlike the U.S., the EU isn’t a single state but a collection of member states, each with its own priorities. If you pass a law that’s too restrictive, a member state might veto it. So, regulations have to be broad enough to avoid pushback while still trying to meet their original goals.

But that’s also why there’s a lack of specifics in EU policy. And while it makes sense, it doesn’t help. We seriously need standards (like a consistent way to opt out of AI training) and precise laws. This ambiguity hinders innovation. After all, a law might be interpreted one way in Spain and completely differently in Poland¹⁷. And no one wants to risk a lawsuit over unclear laws.

It’s ironic. Legislating is about weighing endless variables and unpredictable outcomes. You know what’s good at that? AI. Yes, this is a place where AI could actually help. In an ideal future, I’d love to see AI draft laws, with humans having the final say. Humans would always hold the power, but it’s becoming clear that we’re not always great at weighing long-term consequences. So, why not let something more capable handle that?

Notes

I actually found that the proper citation style is in the form Article X(Y)(Z), but that looks awful to me. Too many parenthesis…
In order to facilitate compliance with Union data protection law, such as the GDPR, data governance and management practices should include, in the case of personal data, transparency about the original purpose of the data collection.
The right to privacy and to protection of personal data must be guaranteed throughout the entire lifecycle of the AI system. In this regard, the principles of data minimisation and data protection by design and by default, as set out in Union data protection law, are applicable when personal data are processed. Measures taken by providers to ensure compliance with those principles may include not only anonymisation and encryption, but also the use of technology that permits algorithms to be brought to the data and allows training of AI systems without the transmission between parties or copying of the raw or structured data themselves, without prejudice to the requirements on data governance provided for in this Regulation.
Information on data sandboxes is still limited. The AI Act only mentions that each EU member state is responsible for setting up these sandboxes (Recital 138), where regulations may be relaxed to promote innovation. Realistically, they’ll likely resemble the testing and experimentation facilities TEFs we discussed in the post on the Coordinated Plan on AI.
The statement refers to the following articles: (i) Article 6.4 GDPR: This article deals with the legal basis for processing personal data, requiring that processing be “necessary for compliance with a legal obligation to which the controller is subject.” This emphasizes that any processing of personal data for law enforcement or judicial cooperation purposes must have a solid legal basis under EU or Member State law. (ii) Article 9.2.g GDPR: This provision concerns the processing of special categories of personal data, which are considered more sensitive and require higher levels of protection. Point g specifically allows processing such data when “necessary for reasons of substantial public interest, on the basis of Union or Member State law which shall be proportionate to the aim pursued.” This is relevant because law enforcement activities often involve sensitive data like criminal records or biometric information. (iii) Articles 5, 6, and 10 of the New Data Protection Framework: These articles mirror the principles of lawfulness, fairness, and transparency in data processing, the legal basis for processing, and limitations on processing sensitive data, respectively, as outlined in the GDPR. Their inclusion emphasizes that these principles also apply to EU institutions and bodies, ensuring consistent data protection standards. (iv) Article 4.2 and Article 10 of the Law Enforcement Directive: While not explicitly stated, the reference to these articles without prejudice suggests that the statement does not intend to limit or restrict the application of these provisions. Article 4.2 defines “competent authority,” which is crucial for determining which entities are subject to the directive’s rules. Article 10 deals with data protection principles, requiring that processing be “necessary and proportionate” and carried out “with due regard for the legitimate interests of the data subject.” This reinforces the need for a balanced approach that protects both individual rights and law enforcement needs.
Among the musts we can see: (2.b) always disclose where your data came from and how you collected it; (2.c) how you processed your data before use; (2.d) formulate assumption in respect of what you expect to find in your data; (2.f) examine possible biases; (2.g) given the biases you are suspecting to find implement measure to detect and mitigate them; (4) data should be geographical, contextual and behavirousal appropriate for the intended purpose (aka if you want an LMM for eu, don’t train it on american data only); (5) if you really cannot do anything for biases with your current data you are exceptionally allowed to process other categories of personal data; (5.c) in this case you have to document who has access to the data; (5.d) you cannot transfer them and (5.c) you have to detele them once you’re done.
However, recent studies (Dong et al. 2024 and Huang et al. 2023) have shown that increasing the quantity of data is not necessarily the most effective technique for reducing bias.
Look up “de-identification” on google scholar.
Recital 106: “[…] comply with the reservation of rights expressed by rightsholders pursuant to Article 4(3) of Directive (EU) 2019/790.”.
Specifically, this article exempts data use under Article 5.a (giving authors control over reproducing their database content) and Article 7.1 (protecting against data extraction and re-utilization) from DD96, Article 2 of CH29 (protecting authors’ copyright), and Article 15.1 of DC790 (covering press publications and online use).
But to be clear, these exemptions are limited to “research organizations and cultural heritage institutions”. DC790 defines a research organization as “a university, including its libraries, a research institute or any other entity, the primary goal of which is to conduct scientific research or to carry out educational activities involving also the conduct of scientific research: (a) on a not-for-profit basis or by reinvesting all the profits in its scientific research; (b) pursuant to a public interest mission recognized by a Member State”.
According to Fortune Business Insider, the AI training dataset market is projected to grow from $2.39 bullion in 2023 to $17.04 billion by 2032 with a compound annual growth rate of 24.7%!
There are various court cases open: Sarah Andersen et al. vs. Stability AI, Midjourney, and DeviantArt (2023), Getty Images vs. Stability AI (2023), Artists Against OpenAI’s DALL-E.
I need to share this I found when I was looking for animal copyright. Back in 2011 the photographer David Slater traveled to Indonesia to photograph wildlife. During his trip, a crested macaque monkey took his camera and snapped several photos, including a famous selfie. When Slater published the photo, it became popular, and he later asserted copyright over the image. However, in 2015, PETA filed a lawsuit on behalf of the monkey, arguing that Naruto, the macaque, should hold the copyright to the photo. They contended that since the monkey had created the photograph independently, copyright should belong to Naruto, with PETA as a legal representative to protect Naruto’s interests. The U.S. District Court anddismissed the case ruling that animals do not have standing to hold copyright under U.S. law. According to the court, copyright law in the United States only applies to works created by human authors, and no existing legal framework allows non-human animals to claim copyright ownership.
Among these companies we have OpenAI, Microsoft and Anthropic.
One well-known example of an EU regulation used in an unexpected way is the General Data Protection Regulation (GDPR). Originally intended to protect personal data privacy, it has also led to some unintended consequences. For instance, in 2018, journalists in the UK tried to use Freedom of Information requests to obtain information on political donations and lobbying activities. However, some organizations and government bodies cited GDPR as a reason for refusing to share data, claiming that revealing donors’ identities could infringe on personal privacy. This led to concerns that GDPR was being leveraged to limit public accountability rather than protect individual privacy.
Labor laws for gig economy workers vary widely in the EU. Spain’s 2021 “Riders’ Law” classifies gig workers as employees, granting them minimum wage, social security, and other protections. Poland, on the other hand, largely treats gig workers as independent contractors, offering fewer protections and more flexibility.

The EU’s Data Strategy: Unlocking Europe’s AI Future

Nicolo' Brandizzi — Wed, 23 Oct 2024 00:00:00 GMT

TL;DR: This blog post covers the evolution of the EU’s data strategy, highlighting major milestones like the Open Data Directive (2019), the EU Strategy for Data (2020), the Data Governance Act (2022), and the Data Act (2024). It explores their impact on data sharing, AI development, and innovation across sectors. Key topics include user empowerment, B2B and B2G data sharing, Common European Data Spaces, and the push for interoperability. While the EU’s vision is ambitious, it faces technical and implementation challenges. Overall, these policies aim to foster a fair, innovative, and transparent data ecosystem in Europe. Don’t miss the footnotes for extra insights and fun facts!

Intro

Welcome back!

In the previous post, we talked about the EU’s coordinated AI plan. I personally learned a lot, and from your feedback, it seems like you enjoyed it too!

A bit after it was published, I was approached with the ~~obligation~~ opportunity to learn more about how data is used in AI systems according to the EU. So, this post will do exactly that. Specifically, we’ll talk about key EU regulations on data and how they shape AI development. I struggled a bit on how to present the information, but in the end, I decided that the best way is to tell you the story of how these data-related activities were created and why they were necessary. For this reason, we’ll cover the plans in chronological order:

TLDR: This post covers the evolution of the EU’s data strategy, focusing on key regulations like the Open Data Directive (June 2019), the EU Strategy for Data (Feb 2020), the Data Governance Act (June 2022), and the Data Act (Jan 2024). We’ll explore how these policies impact AI development, data sharing, and the roles of various stakeholders across Europe.

Just so you know, I usually express my opinions and give ~~useless~~ interesting facts in the footnotes, so don’t miss them!

A European Strategy for Data

So, you’ve probably heard the phrase “data is the new oil” (or something similar). Well, the EU definitely agrees, and that’s exactly why they created the European strategy for data. I just finished reading the EU data strategy document, and oh boy. You should know, though, that it’s from February 2020, so it’s a bit outdated and mainly focused on outlining problems and potential directions. Since its release, other documents (which we’ll see later) have addressed these problems in more detail.

What’s still relevant and interesting is the context this document comes from, the data-related problems it identified in the EU, and the initial solutions it hinted at, which we’ll get into more later on.

📍 Originally published at nicolobrandizzi.com.

Context

It’s February 2020. You’ve heard about some new virus with a name like a beer on the news, but you brush it off as the latest panic. ChatGPT isn’t due for another two years, and the AI hype (as we see it now) is still in its infancy. However, data has already been recognized as the new oil¹ , and the EU has an interest in ~~innovating~~ regulating in this space.

Joking aside, the EU understood the importance of big data for prediction. It’s no surprise that the more data you have, the better you can predict the future².

And this comes into play in many situations, which we can divide based on who benefits from it:

Business [B]: More data means better understanding of what your customer wants and how to sell to them more effectively. Or, you can forget ethics and just sell user data (looking at you Meta)
Citizens/Consumers [C]: Take healthcare, for example. More data in this field leads to more accurate diagnoses and improved treatments.
Governments [G]: As a government, you can use data collected from satellites to predict when the next flood will hit and prevent damage to infrastructure. Or, ideally, solve climate change.

Of course, these entities don’t live in a vacuum, separated from one another. And that’s why it’s important to consider their interactions. So, we take the G from government and the B from business, mix them, and get B2B, B2G, G2B, C2B, G2C (I didn’t make this up). How are these entities relevant to the data strategy? What are even the problems of the data strategy? Let’s see.

Problems

The document lists six major problems (well, eight, but two aren’t that interesting in my opinion) related to data. As I mentioned earlier, these problems are divided by who faces them (businesses, citizens, governments, or combinations thereof), and they come with some fascinating insights.

Availability of Data

I’d argue this isn’t a huge problem anymore. IIn 2020, global data production stood at 2 zettabytes. By 2024, it skyrocketed to 150 zettabytes³ (source)A 75x increase.

Data sharing is often proposed as a solution, but it’s also part of the problem. Specifically, sharing data among:

G2B:This has been a “long-standing EU policy” (since 2003) and is mostly addressed in the Open Data Directive (more on that later). It refers to data produced by the government that can be used in the private sector to enhance decision-making.
B2B: Businesses sharing data! Well, this doesn’t happen much because there aren’t enough economic incentives, and businesses fear losing their competitive edge.
B2G: It would be great to have businesses share insights with governments to improve public policy, but back in 2020, there weren’t enough willing participants to make this work.
G2G: Imagine driving your German car in Spain and getting a speed camera ticket. But since Spain and Germany don’t share vehicle data, you don’t get the fine. Too bad! But seriously, imagine the possibilities in health and personal identification.
B2C: This involves businesses sharing data with consumers directly. One good example is smart home devices. Companies collect data on your energy consumption, but ideally, this data would be shared with you so you can make informed decisions—like finding the best time to run appliances to save on energy bills. However, the problem here is that most businesses are reluctant to make this data easily accessible to consumers, fearing the loss of control or monetization opportunities.
G2C: Here, governments share data directly with citizens. An example might be providing public health data or environmental information, like air quality in your area. The challenge lies in making this data available in a user-friendly format, ensuring citizens can actually benefit from it.

Data interoperability and quality

I have to admit, I wasn’t sure what “interoperability” meant at first. According to the Oxford Dictionary, it’s “the ability of computer systems or software to exchange and make use of information.” Since it’s a mouthful, let’s just call it “interops.” This ties into market fragmentation, a major concern for the EU, which is keen on creating frameworks and standards(see ICT Standardization).

Out of curiosity, I dug deeper⁴ and discovered the EU has been focused on interops since 2011 through Joinup, which later teamed up with ISA and ISA²⁵ in 2014 and 2019, respectively. TIn 2021, they renamed the initiative Interoperable Europe.The bottom line is data sharing in a pan-European network.

Data Infrastructures

Let’s say you’ve solved the data problem. You have plenty of data, and everyone is sharing it happily between each other (b2c2g2c2g2g or whatever). But where are you actually processing all this data?

You need infrastructures for that, and this has been (or maybe still is?) a problem in the EU. In the previous post, I mentioned EuroHPC JU and AI factories (perhaps these could be part of the solution).

What About the Citizens?

Finally, we have two problems that directly affect citizens. The first is that Article 20 of the GDPR (the one about controlling who and how your data is used) is not really feasible for the average citizen due to a lack of technical tools ⁶. The second issue is related to digital literacy, which ties into initiatives like the European Skill Agenda that we also covered in the previous post.

Possible directions

Along with this long list of problems, the document also points toward possible solutions. Keep in mind that most of this stuff is old and more like guidelines, but we’ll discuss how these directions have actually been implemented later in the post. Still, to “spezzare una lancia” (lit. to break a spear, meaning to defend or strike a blow in favor of) for the Commission, we should mention the concept of data spaces. ’ve been hearing about these data spaces for months, and I finally have a clearer idea of what they are. The document states that the Commission will invest 2 billion euros in this direction, and these spaces will follow FAIR principles, ensuring that data is Findable, Accessible, Interoperable, and Reusable.

Further down in the document, you can find nine examples of data spaces in high-impact sectors, similar to what we discussed in the previous post. One example includes intelligent transport systems.

Data Governance Act

The short version explanation of the Act doesn’t say much, apart from some vague objectives. So, I had a choice: either read the full document (44 pages-no, thank you) or check out this nice summary. I went with the latter.

The whole Act focuses on five key points, each tackling different aspects of data (or information) flow.

Protected Data

First things first, we need to define what kind of data we’re talking about. Is your account on that naughty website considered protected data? Unfortunately for you, not really. So, what is protected data?

The explanation isn’t super clear, but after checking the full document, we can find a few examples of protected data:

Commercially confidential data (point 10): Any data whose disclosure would impact the market position or financial health of a company. Think of something like the prompting techniques for OpenAI’s new model O1.
Data protected by intellectual rights: This covers any data that falls under copyright licenses, trademarks, or patents.
Personal data: The definition of personal data is found in the GDPR document(Article 4(1)) and states: “personal data means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, […]”. This could be anything from names and locations to genetic, mental, or economic features.

It’s also important to note that this protected data must be held by public sector bodies, meaning private businesses are exempt in this case.

Now that we know what protected data is, let’s also talk about what isn’t protected. For that, we need to look at the Open Data Directive.

Non-protected data

The Open Data Directive (ODD) (June 20, 2019) predates the Data Governance Act and even the EU’s data strategy. It focuses on the reuse of “open data,” which is essentially data produced by public entities that should benefit society as a whole.

It’s hard to capture the entire concept of open data just by listing its properties, but a few examples make it clear. These include:

Earth observation and environmental data, such as those provided by the European Space Agency (ESA)
Statistics, like anything you’d find in the Eurostat catalogue
Company and ownership information, such as data on registered businesses and their financial records.

So, what characteristics should this data have? It should be open by design. This means the data must be freely accessible online, in a machine-readable and platform-independent format. It should also be accompanied by a license with minimal restrictions⁷ and relevant metadata.

Metadata is of particular interest to me, so I looked into what they might mean. The only explanation I found was “metadata, at the best level of precision and granularity, in a format that ensures interoperability”. This feels like saying “it has to be good enough for things to work”. They do give an example, mentioning that “spatial information” should comply with the Directive 2007/2/EC, also known as INSPIRE. Apparently, INSPIRE is a fully developed project, and I went down a rabbit hole reading about it. You don’t need to know all the details to understand how spatial data is defined, but for the curious reader, check out the next footnote ⁸.

In terms of metadata, INSPIRE defines XML schemas as the standard format for spatial data, with specific specifications for different types of data. For instance, a disease can have “gender” as an attribute when describing a population.

How to Reuse

By now, we’re familiar with the concept of protected data and how it differs from public data. So, let’s see how the EU wants to promote data reuse while still protecting the privacy of the individuals involved. Most of the rules are on the public sector side, with the Commission setting out guidelines for how they should encourage the sharing of the protected data they already possess. Here are some of the more interesting rules:

(Reasonable) fees and timelines: The public sector can charge fees⁹ when you request data, and they have up to two months to decide on your request.
Assistance: If you ask for data you can’t access, the public sector is responsible for helping you contact the owner of that data so you can request it directly. Additionally, the EU has created the European register for protected data held by the public sector (ERPD) to help you find out who holds the data you’re looking for.
Technical requirements: This is the most interesting part. The Commission says member states “need to be technically equipped to ensure that the privacy and confidentiality of data is fully respected in reuse situations”. They mention techniques like anonymization, secure processing environments (e.g., data rooms¹⁰), or bilateral confidentiality agreements between the parties involved.

B2B Data Sharing

As mentioned in the EU strategy for data, one of the reasons businesses don’t share data with each other is the fear of losing competitive advantages. This is especially true if the company providing the data isn’t sure how securely the recipient will handle it. To address this, the Commission has outlined rules for potential “intermediation services.” These services would be responsible for facilitating B2B data sharing.

To better understand why we even need data from other businesses, let’s say you’re the CEO of a vehicle-sharing startup in Madrid, Spain. You’ve done your research and know the city is in desperate need of electric monocycles, but you’re unsure where to place them to avoid competing with other companies. Now, imagine you could hop on a website and see that Lime is offering their usage data for €4.99. That would be amazing! You’d find out that the best place for your monocycles is near the Carampa Circus School.

Connecting you with this dataset is exactly what these intermediation services are about, and the EU wants to regulate them (naturally). These services must remain as neutral as possible in handling the data. This prevents conflicts of interest, such as a service owning a large share of Lime and trying to block your monocycle business from competing in Madrid. These services are also required to provide metadata that “improve(s) the data intermediation service.”. If you play by the rules and respect all these regulations, the EU will give you a nice sticker. So far, there are 11 certified “good boys” in 4 countries (most of them are in France).

Data Altruism

Lovely name, questionable results. Here the EU is basically saying that if you’re kind-hearted and want to share your data for free, you’ll have to agree to a bunch of rules. First, you must be a nonprofit organization, and then you need to comply with an entire rulebook (which I couldn’t find). Once you meet the requirements, you can register as a data altruism organization and receive an even nicer sticker. Shoutout to the only registered organization so far: the Associació Dades pel Benestar Planetari from Spain.

European Data Innovation Board

Unsure how to navigate all these regulations? (Trust me, you’re not alone.) The EU has you covered with the newly created European Data Innovation Board (EDIB).

The EDIB is a group of experts whose goal is to define cross-sector standards and ensure interoperability for data across the EU. Honestly, I couldn’t find much more about this board. Who are the members? What have they accomplished in the past year? Who knows…

International data flows

The final point in the Data Governance Act addresses international data flows. You may have already heard that the EU enforces strict data privacy regulations on other countries handling EU citizens’ data. For example, the data privacy framework program is an EU-US agreement that regulates how EU data must be handled in the US. Interestingly, this often gives EU citizens more rights over their data in the US than American citizens themselves¹¹.

The Data Governance Act extends the GDPR and aims to ensure that the handling of non-personal data in third countries is subject to the same protections. This usually takes the form of international agreements.

Recap

So far, the aim of the Data Governance Act is pretty straightforward: to regulate the flow of protected information within the EU. Data flows can occur between any entities (B2B, G2B, C2B, etc.) and must adhere to certain standards (fairness, transparency, privacy protection). The goal is to build trust in voluntary data-sharing mechanisms. That’s great! But what about non-voluntary applications? Is there a legal framework that regulates data access and use in general? Yes! That’s the Data Act, which we’ll cover in the next section.

Data Act

So far, we’ve covered various initiatives aimed at regulating data flow within (and outside) the EU. However, most of what we’ve seen has just been an introduction to the real juicy part: the Data Act.

In the following section, I’ll mainly refer to this great overview of the Data Act,, but I’ll also occasionally dive into the full document for additional insights. Let’s dive in!

Context

Here’s the timeline so far:

June 2019: The Open Data Directive was introduced to regulate how public data should be made open for societal benefit.
February 2020: The EU set the stage for its data strategy, outlining key problems and identifying possible solutions.
June 2022: Two years later, the first document focused on how data flow should be managed in the EU was released. It established some standards and set up registries and boards. However, this Act primarily centered around voluntary data sharing (including for-profit sharing in B2B contexts) without laying out clear legal boundaries.

Fast forward to today, and the EU clearly needs proper guidelines for data sharing between all levels: Governments, Businesses, and Consumers/Citizens. The Data Act consists of 9 chapters, each focusing on a specific aspect of data sharing.

Chapter I

This is just an introduction (also known as General Provisions). It gives an overview of the upcoming chapters and defines some key terms. Personally, I prefer defining terms as they come up rather than listing them here. So, let’s skip this one.

Chapter II: B2C data sharing

Chapter II “catches two pigeons with one fava bean”. This chapter forces businesses to allow customers to access their data easily and at no cost. It also allows users to share their data with other businesses. How is this helpful? Let’s take a look.

It’s Your Data!

As the title suggests, it’s your data, so you should be in control. I strongly agree with this principle, and I’m happy to see it being implemented. Specifically, the Act says that as a user, you are entitled to know what data you’re generating, how it’s being used, and to access it through a simple and free process.

For example, imagine you have a smart thermostat that collects data on your energy usage. Under the Data Act, you have the right to access that data, understand your consumption patterns, compare energy providers, or even share it with a third-party service that optimizes energy efficiency.

However, let’s clarify what kind of data falls under these rules. The Act covers “all raw and pre-processed data generated from the use of a […] service” both personal and non-personal. The keyword here is raw. The Act provides a helpful example: “For example, if a user watches a film on their connected TV, the film itself is not within scope but data on the brightness of the screen is within scope.”

Fostering Innovation

In the thermostat example above, we stumbled upon an entirely new business idea, right? A platform that allows you to compare energy providers would benefit on access to your consumption patterns. So, by simply sharing data, new businesses can emerge, and that’s the second catch!

By requiring businesses to share some of the data they collect, the EU is creating opportunities for new businesses to emerge. Of course, if you end up using that data, you can’t directly compete with the business that provided it (the data holder). Instead, you can focus on related or aftermarket services¹². Also, the data holder is not required to share any data that would reveal trade secrets (understandable).

Chapter III: B2B Data-sharing

While the earlier rules focused on users, here we switch to B2B actions. The aim is to promote fairness and prevent giants from using their privileged positions and vast amounts of data to dominate the market.

The regulation involves all kinds of data mentioned in Chapter II and states that data holders are obligated to make data available to other businesses (referred to as “data recipients”) under appropriate remuneration ¹³. The catch here is that this obligation isn’t directly enforced by the Act itself. Instead, member states must pass legislation to make these obligations legally binding.

For example, if tomorrow Italy passes a law requiring OpenAI to share user-chat history with other businesses, OpenAI couldn’t refuse. So, it’s essential to understand that Chapter III provides a framework for creating laws rather than a law itself.

Chapter II vs III

What’s the difference between Chapter II and III? We’ve already seen that new businesses can emerge based on user data, so what’s new here?

User requests vs legal obligations

First of all, Chapter II deals with user requests for data. So, you can go ask Google for your location data, get it, and then run it through theLocation History Visualizer to create a nice visualization. You’re essentially transferring data from one source to another.

Chapter III, on the other hand, requires companies to share certain types of data with other businesses. For example, a rideshare service like Uber collects real-time traffic data as part of its normal operations. Under Chapter III, this traffic data could be shared with other businesses, like a company developing smart navigation tools for logistics services. Unlike Chapter II, this isn’t about a user requesting data, it’s one business making data available to another under fair compensation rules. This helps foster innovation while preventing monopolies from hoarding data.

Data scope

When you request your own data from a business, you’re limited to the data you helped co-create through your interactions with the service. Chapter III, however, refers to any kind of data, meaning all the data generated by interactions from all users.

Chapter IV : Unfair Contractual Terms

This chapter builds on the previous one. Let’s say a law was passed requiring companies to share user interactions with their chatbots (properly anonymized, of course). What’s to stop the data holder from imposing restrictive terms on how their data can be used? This is where Chapter IV comes in. It actually provides a detailed list of unfair terms. Among them, we find:

Exclusion of liability for intentional acts or gross negligence: In the chatbot company example, this would be like the company adding a clause that exempts them from liability if the data they provide is manipulated or completely made up.
Exclusion of remedies for non-performance: Similar to the first point, but here the data quality is poor, and the company avoids being held accountable.
Exclusive right to interpret the contract or data conformity: This would be like the company arbitrarily deciding whether the data they gave you meets the agreed-upon standards.

The Act also lists presumed unfair terms, which I won’t cover here, but they generally aim to prevent power moves by bigger players¹⁴.

Alongside these “forbidden” terms, the Act discourages any “unfair” terms. These are unilateral, take-it-or-leave-it terms. If a company includes them in a contract, an unfairness test can be requested by the appropriate EU authorities.

Chapter V : B2G

This chapter is different. It regulates the flow of data from businesses to governments, both in emergency and non-emergency cases. Since this touches on a topic I’m particularly interested in (increased citizen surveillance during emergencies), I went ahead and read through the entire full document.

Should the EU Have Your Data During Emergencies?

If you’ve read my previous blog post, you already know I’m biased on this topic. History shows that measures adopted during emergencies tend to stick around even after the emergency ends¹⁵. So, I’m not thrilled with the idea of forcing big businesses to share their data for free in emergency situations.

That said, there are several articles and points in this chapter that seem designed to prevent a hypothetical dictatorial EU from abusing this law to spy on its citizens. For example:

Article 17(1.g) states that when personal data is requested, technical measures like pseudonymization or anonymization can be applied by the data holder before handing over the data. However, Article 18(4) says that if anonymization hinders the intended use of the data, then pseudonymization is allowed instead.
Article 19(1.c) , states that data must be deleted once it is no longer needed (unless archiving is required, lol).

While I’m still not a fan of these use cases, I get why accessing this kind of data might be necessary in some situations. Forget the usual suspects like terrorist attacks and illegal activities; take, for example, natural disaster response. Google and Apple shared anonymized location data during the COVID-19 pandemic to help public health officials track the spread of the virus and understand movement patterns.

What about non-emergencies?

In non-emergency cases, the government body requesting the data cannot ask for personal data, and businesses can demand remuneration (but not exceeding the technical and organizational costs). This is useful in cases where a city wants to optimize traffic (Rome, please take note) by using data from businesses like Uber or Ecooltra.

Chapter VI: Switching Between Data Processing Services

Honestly, this chapter would fit better after Chapter II because it still revolves around customers. The only reason I’m sticking to this structure is my obsession with ascending numerals.

Anyway, this is a simple one. It mandates that service providers must make it easy for customers to switch to another provider. This includes transferring customer data from one service to another within a maximum period of 2 months. After the switch, the original provider must also delete all customer data. I love this! It gives providers a clear incentive to retain their users, if they don’t, they lose revenue and data.

And starting in January 2027, switching will be free. For now, they can only charge for the necessary operations involved in switching (I wonder what the cap is).

Chapter VII: Unlawful Third-Country Government Access

This is very similar to what was discussed in the Data Governance Act, particularly in the section on international data flows. The goal is to control how EU data is handled by service providers outside the EU. This chapter focuses on non-personal data, while the DGA covers all types of data.

Chapter VIII: Interoperability

Finally, we’ve arrived at the part I’ve been waiting for. This chapter finally mentions EU data spaces, and it’s primarily focused on creating a European Single Market.

Basically, the short document didn’t provide much information about standardization, so I told myself, “I’ll just read the full document and report back.” Well, it turns out that the full documentation is also vague.

We have to remember that the Data Act is not really a legislative document but more of a preliminary step. It lays out a framework for creating laws later on. I’m just telling myself this to stay optimistic because it means I’ll have to do more digging before wrapping up this blog post. Ranting aside, this whole chapter can be summed up with the sentence: “We need standards, we don’t have them yet, but when we do, the EDIB will help.”

Common European Data Spaces

Since I was disappointed by the lack of mention of the Common European Data Spaces (CEDS), I looked online and found this short description. It gives a practical explanation of CEDS: they aim to create secure, trustworthy environments where data can be pooled, accessed, and shared in real-time across sectors like healthcare, agriculture, and energy. The goal isn’t just to put data online but to ensure it’s reused under clear rules, aligning with EU values like data sovereignty and privacy protection.

Since the document also referenced the Second Staff Working Document, I read that too.

The data spaces rely on interoperability and standardized protocols for data exchange. Although the sources don’t explicitly detail the use of federated approaches, hints suggest this might be part of the infrastructure. For example, the Data Act supports a distributed approach, allowing users to control and share data without needing a centralized repository. The Simpl platform, an open-source middleware, is being developed to support this interoperability.

Federated learning is also a key technique, especially in health data spaces, where data stays localized, but models can be trained across multiple datasets without moving sensitive information. This approach supports privacy while enabling innovation.

Although there’s no exact list of current users, the main participants are likely to include:

Businesses: Large and small companies can use the data spaces to innovate and optimize their operations.
Public Sector: Government bodies can use data to improve services like smart cities and transport management.
Researchers: Academics will benefit from access to data for scientific advancements.
Citizens: Individuals won’t directly access these spaces but will benefit from improved services and products.

Chapter IX: Enforcement and Overarching Provisions

In simple terms, this chapter says that each member state must appoint a data coordinator from among the existing authorities. The coordinator’s job is to manage and coordinate both public authorities and businesses in applying the laws surrounding the Data Act.

They also reference the European Data Innovation Board, which we discussed previously.

Summary

Let’s piece everything together and summarize the Data Act.

We’ve seen how the Data Act emphasizes access and user rights, complementing the Data Governance Act (DGA), which focuses on creating a trustworthy environment for voluntary data sharing. We’ve also observed how some chapters in the Data Act extend concepts covered in the DGA (such as Chapters VII, VIII, IX, and parts of Chapter III).

The juicy details, though, are in Chapters 2, 3, 4, and 5. I particularly like the focus on fairness and user empowerment. I believe that by enabling data to flow more easily in the ways we’ve explored, the Act will boost innovation—or at the very least make it easier for aftermarket businesses to emerge.

Conclusion

So, what’s the takeaway from all of this?

The EU’s data strategy is ambitious, and the Data Act represents a bold attempt to balance innovation with fairness and user empowerment. The shift towards making data more accessible (whether it’s for businesses, governments, or individuals) is an exciting development.

Key Takeaways

Empowering Users: The ability for individuals to control and share their data with third parties (like in the smart thermostat example).
Boosting Innovation: By forcing businesses to share some of their data (under fair terms), the EU opens the door for new startups and aftermarket services.
Regulating Fairness: Chapters on B2B and B2G data sharing are designed to prevent monopolies and ensure that data isn’t hoarded by industry giants.
Data Spaces: The vision for Common European Data Spaces shows a real effort to create sector-specific hubs for innovation. However, there’s still a lot of ambiguity around how these spaces will function in practice—whether federated learning and interoperability will be enough to solve the challenges ahead remains to be seen.

Personal Thoughts

I have to admit, while the EU’s vision for data-driven innovation is commendable, there’s a lot of “we’ll figure it out later” in the documents. The lack of clear technical details (especially around the architecture and implementation of data spaces) makes it hard to predict how successful this will be in practice. We engineers like to see blueprints, not just grand ideas!

That said, I’m optimistic that once the wrinkles are ironed out, the Data Act and the broader EU data strategy will lead to a more transparent, innovative, and fair digital ecosystem. The road ahead might be full of technical challenges, but if the EU sticks to its principles, we could see a real transformation in how data is shared and used across sectors.

Notes

Clive Humby famously stated “Data is the new oil” all the way back in 2006.
Shout out to Asimov’s Foundation series and the concept of Psychohistory.
If you heard about the dead internet theory you might argue most of this new data is AI-generated. While I don’t buy into the conspiracy, a recent study found that 75% of web data is machine-translated.
I dug into this because since joining Fraunhofer (only 7 months ago, though it feels like a lifetime), I’ve constantly heard about this huge Gaia-X project. After some searching, I finally found it mentioned in the Open source observatory (OSOR) collection on the Joinup website and found a mention of it. It doesn’t say much, but I’m happy I managed to locate it within the broader EU schemes. The project as also mentioned in the 2024 Rolling Plan for ICT standardisation (under related standardisation activities, section c.1) and in the EU Data Strategy Document itself!
Short for Interoperability solutions for public administrations, businesses and citizens.
Shout out to companies like Incogni that have made removing your data from data brokers their business model
Here they mention a “standard license” (Article 8) but don’t give examples.Reading the description it seems they refer to something lice CC0, PDDL, ODC-BY or MIT.
So you clicked, huh? Ok, if you’re that interested, who am I to let you down? INSPIRE is an EU directive and then project started in May 2007 aiming “to create a common spatial data infrastructure for the purposes of EU environmental policies and policies or activities which may have an impact on the environment”. Basically the project : (i) defines best practices for storing various types of metadata; (ii) provides data models for spatial data, i.e. ways to define specific data such as house addresses in a defined way; (iii) offers technical guidelines for a variety of spatial features (e.g. elevation); (iv) includes schemas in XML; (v) and even a metadata validator to check if your metadata conforms to standards.
They also suggest that fees should be low or non existent for research purposes or SMEs.
It’s not the first time I’ve come across the concept of data rooms. It’s one of the key ideas behind the GaiaX framework.
Check out this article for a quick overview of how the US and EU differ on privacy.
Aftermarket Service include sales, accessories, services, and enhancements that come after the product’s sale (source).
For SMEs and nonprofits, the remuneration must not exceed the costs incurred in making the data available!
The “greylist” terms in the Data Act aim to stop abusive practices in data-sharing contracts, ensuring fairness by protecting weaker parties from excessive liability, unreasonable contract terms, and restrictions that prevent them from accessing or using their data. The burden of proving the fairness of these terms falls on the stronger party imposing them.
Notable examples include the USA PATRIOT Act, UK Regulation of Investigatory Powers Act, Emergency Surveillance Laws Post-November 2015 Attacks (France), Australia’s Telecommunications (Interception and Access) Amendment (Data Retention) Act.

Europe’s AI Research Agenda: Coordinated Plan and Horizon Projects Explained

Nicolo' Brandizzi — Sun, 13 Oct 2024 00:00:00 GMT

Intro

In the previous post we figured out that the Horizon Europe project contains some redundancies in AI-related project. My next step was to figure out a fix to this problem, but in order to do so, I had to dig deep into how the EU handles AI projects and agendas. I started off easy, asking ChatGPT to provide a list of EU policies,organizations and anything else related to AI. Of course, the first suggestion was the EU AI Act which I’m now familiar with, and this just made me feel a bit cocky, thinking I was already an expert. Well, the second suggestion was the Coordinated Plan on Artificial Intelligence (PlanAI). It’s just a plan, how complicated can it be? Well, let’s just say the rabbit hole was deeper than expected.

So, join me in this journey of dead links and never-ending references. In this post (probably the first of a series), we’ll dissect the PlanAI and see how many projects are connected to it. If you prefer a visual approach, I made this nice graph where all these initiatives are connected and can be interacted with. The graph will be continuously updated, so don’t worry if some info is missing right now.

Let’s begin.

Coordinated Plan on Artificial Intelligence

The PlanAI was first published in 2018 and then reviewed in 2021¹, when the EU (and the world) realized the potential impact of AI. The changes between the two versions mainly focus on:

Planning: The 2021 version emphasizes human-centric, sustainable and secure AI. It also aims to achieve EU digital sovereignty.
Funding: This is the most significant difference. The original plan was to invest at least €20 billion by 2020 using the Horizon 2020 and Horizon EU. The new plan increased the target quite a bit, now aiming for €145 billion by 2023 through Horizon EU, the Digital Europe Programme and the Recovery and Resilience Facility.

For everything else, the new PlanAI is based on four policies reflecting key objectives in AI.

📍 Originally published at nicolobrandizzi.com.

1. Enabling Conditions for AI’s Development and Uptake

Let’s abbreviate it to EnablingAI. Here, the aim is to actually have the resources necessary to make AI a reality in EU. This translates in launching two new initiatives: AI Factories and GenAI4EU.

AI Factories

Announced in January 2024, the AI Factories will come to life in 2025 and are currently just an amendment to the The European High Performance Computing Joint Undertaking (EuroHPC JU). So, for now, we only have a vague description of what a factory will look like. Specifically, it will include a mix of hardware (supercomputers and data centers), “human capital” (I guess they mean experts from academia and research institutes), and startups/SMEs.

But how? Well, the commission and member states will invest an additional €2.1 billion to acquire more computing power. On top of that, the commission will provide financial support for startups (at least €100 million) with a cap of €1 billion coming from InvestEU.

GenAI4EU

Honestly, I couldn’t find much on what these are supposed to be. The description states it “aims to support the development of novel use cases and emerging applications in Europe’s 14 industrial ecosystems, as well as the public sector. Application areas include robotics, health, biotech, manufacturing, mobility, climate and virtual worlds.” What I could find is that the commission is planning to use the Horizon Europe and the Digital Europe programmes to fund projects that are connected to industry application of GenAI (in the above field) for a total of €500 million². Also these initiatives will cooperate with the AI Factories and the Common European Data Spaces. I guess the latter will provide compute/experts and the former data. Finally, the EU AI Office together with the European Transition Pathways Platform will, will monitor GenAI4EU’s progress.

2. Thriving From the Lab to the Market

While point 1 focused on the material necessities to build AI (mainly computing resources and experts), the aiLabMarket focuses on fostering Public-Private Partnerships (PPP). Specifically, the Commission will provide a €1 billion per year in investments in AI through our friends: Horizon Europe and the Digital Europe. Another €20 billion will be provided by member states and the private sector. Finally, the whole initiative can utilize the Recovery and Resilience Facility (RRF) funds³, which amount to €134 billion.

Of course, to do so, you need to intervene in various aspects:

The actual coordination between experts (academia) and private sector: Enter The AI, Data and Robotic Association (ADRA) (more about it later).
Support for smaller players: If you’re a small player with a big idea, you can hop on the (https://www.ai4europe.eu/) and check what available resources (algorithms, data, and computing power) you can use to achieve your goals!
Regional-level support: The (https://european-digital-innovation-hubs.ec.europa.eu/home) can help at a regional level.
How do you coordinate between experts and private sector if you have no experts? You need a network of (academic) organizations that foster AI research of course. That’s why the European Networks of Excellence in AI (NoEs) exits.
Ok, you have the experts and the private partners. They come up with a nice idea and implement it. Will it be in line with the never-ending, always-changing EU regulations? Why don’t you try your idea out in the newly created testing and experimentation facilities (TEFs)!

In the next parts, we will explore each of these points in detail.

AI, Data and Robotics Association (ADRA)

As mentioned before, ADRA is an association with the aim of promoting PPP by accessing €2.6 billion euros from the Horizon 2020 and Horizon Europe funds (with additional contributions by private partners). It’s called association right? So who are the members? Well, the initial ones were:

Hopefully, we’ll have time and energies to figure out where these organizations come from later. As now (06/10/2024) there are a total of 140 Members: 88 Research, 47 industry, 4 associate and 1 strategic (Sweden AI). Not bad, considering the project is 3 years old (created in May 2021) and they managed to get major players on board (e.g. BMW, Decathlon, Bosch…).

Goals

Going back the ADRA’s goals, as stated in the Partnership Proposal they have three general objectives divided in seven specific ones (see Figure —ADRA, please don’t sue me for using your picture).

Apart from the usual suspects (securing European sovereignty), we can see a specific aim to promote the creation of innovative⁵ initiatives⁶ aimed at maximizing societal⁷ and economical⁸ benefits. To me, it seems that ADRA has the most challenging objective: balancing EU values for societal benefits (avoiding risky use cases), with regulatory/market fragmentation and uncertainty, and companies objectives. Good luck.

How?

ADRA’s proposal (Section 3) lists a number of activities in order to achieve its goals.

Basically, these activities can be divided into support for innovation (lighthouses, mission-based challenges, cascade actions, actions to stimulate uptake, and market/innovation enablers) or support for ecosystems (community building, business models/organizations, regulatory/standards and task forces).

Current status

Currently, ADRA released the fourth edition of theStrategic Research, Innovation, and Deployment Agenda (SRIDA) where they outline “the long-term vision for the development and deployment of trustworthy AI, Data, and Robotics technologies in Europe and provides key recommendations to guide future European work programs.”. I highly suggest reading it. ADRA also has a series of forums ongoing, the latest being last November (which I missed it!).

AI-on-demand platform (AIoD)

While ADRA actively works on connecting the public and private sectors, it cannot materially help all possible small and medium-sized enterprises (SMEs). For these SMEs, a federated approach is more suitable, meaning that they need to help themselves. How can they do that? How can you, with a great business idea, look around and see what others are offering (algorithms, data, compute, expertise), and use those resources to your advantage? Well, have you heard of the AI-on-demand platform?

Besides having a slow website, the platform hosts several services:

Assets Catalogue, aka MyLibrary, where you can access AI models, datasets, and experiments. In the dataset section, it mainly features content from Hugging Face, while the models seem to come from Bonseyes (a marketplace for AI in cloud and edge) and OpenML (another platform for AI stuff). It also mentions “powered by AI4Eosc”, which stands for Artificial Intelligence for the European Open Science Cloud.
Research Bundles give you a space on the AI on-demand platform to collect and publish the outputs of a small research project in a compact way. A research bundle gathers in one place all the assets (code, data, tutorials, examples, etc.) produced by your project and published on the AIoD platform. Of course, you can also include links to assets published elsewhere, like GitHub or Zenodo.
An AI builder created in partnership with Eclispe and built at Fraunhofer IAIS (wink wink).
The Research and Innovation AI Lab (RAIL), a beta version of what seems to be an educational-oriented platform to use AIoD assets.

Speaking of funding, the platform is supported by Horizon 2020, Horizon Europe, and the Digital Europe programme (focused on industry and public administration), but I couldn’t find any specific numbers.

European Digital Innovation Hubs (EDIHs)

So, if you got this far, you now know that if you’re a big company, ADRA will carry your hand through the EU regulations⁹, and if you’re small but independent, you can explore what AIoD has to offer.

But what happens when mamma ADRA has no time for you and you don’t understand what AIoD is all about (I feel you)? In that case, you can turn to your nearest European Digital Innovation Hubs (EDIHs). EDIHs are regional hubs that promote digitalization for both the public and private sectors. There are currently 150 working in AI, with their objectives defined on the website as: “EDIHs help all companies seeking to use AI technologies to become more competitive on business/production processes, products or services.” If we choose a random one from the catalog, we see that they mostly provide support (check the services). While 50% of their costs are funded by the EU and member states, the remaining portion must be covered by private contributions and regional funding. I like this approach because it forces EDIHs to be active at the regional level, hopefully solving smaller but very real problems.

The website presents several more initiatives, which I’ll briefly list here, and we can see if they come up again:

Digital Transformation Accelerator (DTA): A project that aims to support the creation, development, and growth of the pan-European DIH network, and facilitate intra/inter-regional collaboration between EDIHs. It ended in March 2022
Digital Maturity Assessment Tool (DMAT): Developed by the European Commission Joint Research Centre (JRC), it’s a framework to measure the digital maturity of EDIH customers across Europe¹⁰.
Key Performance Indicators (KPIs): A reporting tool used by EDIH to report their progress to the commission.

Networks of Excellence Centres (NoEs)

Until now, we’ve covered initiatives dedicated to bringing innovations to life with the support of experts. But what happens if there are no experts to rely on?

This is where the Networks of Excellence Centres (NoEs) come into play. Attentive readers may notice that the link for NoEs actually redirects to the Vision4Ai website. Digging around, I found that the VISION project was allocated ~ €2 million from Horizon 2020, and it lasted from September 1st 2020 until this August (31st, 2024). Its main objective was to coordinate the NoEs, and it appears they have achieved this goal. At this point, it’s unclear who will take on the role of coordinating the NoEs, but it could also be the case that such coordination is no longer necessary, as the centres are already in contact with each other.

Speaking of centers, these are the current ones:

Start Year 2020 AI4media ELISE¹¹ Humane-AI net TAILOR 2022 ELSA euROBIN 2023 ELIAS dAIedge Enfield

Note: All centres established in 2020 ended in August 2024.

NoEs reports to be strongly linked to ADRA and wants all the projects to communicate via the AIoD platform.

As mentioned before, the primary focus here is on academia: “fostering research excellence and creating a critical mass of European AI knowledge and talent.”

Testing and Experimentation Facilities (TEFs)

Finally, we’ve reached the last initiative in aiLabMarket, which is related to the last step in the line: testing. For this purpose, the Commission has invested €200 million to create four sector-specific Testing and Experimentation Facilities (TEFs) for AI., specifically for:

Healthcare with TEF-Health
Agri-food with AgrifoodTEF
Manufacturing with AI Matters
Smart Cities & Communities with Citcom.ai

But what exactly are these facilities, and how does testing work? Let’s take a closer look.

We’ll use Citcom as an example since it relates to a filed I’m somewhat familiar with (smart-cities). Based on their website, they offer several services, ranging from providing access to data to conduct tests in real-life situation (e.g., deploying systems in Belgium). All of these services have a Technology Rediness Level (TRL) threshold of 6 to 8, meaning that your system must to be at least be demonstrated in a relevant environment.

Overall, TEFs are not overly complex. They are places where you can bring your system to be tested in a near-final environment. The official EU page also mentions that, after the initial five-year funding, the TEFs are expected “to achieve long-term financial sustainability”, which likely means companies need to pay for these services. Since testing is necessary to deploy your system at the EU level, it could become mandatory for companies looking to bring their products to the European market. The real question is whether these additional testing costs will be passed on to end users.

3. Ensuring That AI Works for People

To recap (I need this more than you): EnablingAI focuses on supporting the creation of “physical” AI infrastructures (e.g., compute centers), while aiLabMarket aims at creating synergies between innovation (e.g., research), the market (SMEs), and the public sector (PPP).
So, what’s left? Well, society!

The third policy pillar is ensuring that AI works for people (which we will abbreviate as aiPeople). This pillar covers all the parts related to: (i) educating and training people on AI (including efforts to preventexperts from fleeing to the US), (ii) maintaining social cohesion, and (iii) making the EU a leader in AI regulation.

To fully understand what these efforts are about, we need to first “partire per la tangente” and introduce the EU digital Strategy|Decade|Programme|Compass (yes, these are all different things, and there are more¹²).

EU Digital Initiatives

Let’s start with the European Digital Strategy (EDS). The EDS is the overarching framework under which all other digital initiatives reside. It was first introduced in 2020 in this press release as a successor to the Digital Agenda and has BIIIG objectives. Among those, the one that I’m most excited for is the “single digital market”, which addresses the fragmentation problem we’ve discussed before.

In March 2021, the Commission proposed the Digital Decade (DigiDec) as a structured policy programme to implement the goals of the EDS. Essentially, the EDS had big plans, and the DigiDec laid out a way to achieve them. However, the DigiDec is still not specific enough; for example, it does not provide metrics to measure whether the goals have been achieved.

Enter the Digital Compass (DigiCompass). If you click on the link for the DigiCompass, you will notice that it is one of those long, detailed EU Commission document. This is where all the metrics and objectives from the DigiDec are defined. If you have time to spare, you can scroll all the way down to the annex, where you’ll find a table of “cardinal points”, aka objectives. Among those, we finally see goals related to AI, such as:

Increasing the number of employed ICT specialists to at least 20 million, some of whom will focus on AI.
Achieving 20% of global semiconductor production value in Europe, including those for AI factories.
Ensuring that at least 75% of European enterprises have adopted AI.

Ok, all very cool, but where does the money come from? You guessed it, the last piece of the puzzle is the Digital Europe Programme (DigiProg). It has a budget of €7.9 billion and is responsible for financing EDIH, GenAI4EU, and AIoD.

Nurturing Talent and Improving Skills

This point is quite straightforward. The EU aims to develop AI literacy and expertise among its citizens. There are two levels at which they approach this, depending on the level of specialization desired:

Basic Level: Covers AI literacy, i.e. knowing AI exists and how to use it. This is formulated as “teaching professionals in all sector about AI” (as part of the Skills Agenda), hoping they will use AI in their work¹³. It’s also nice to see a focus on integrating AI education in schools and supporting the use of AI to enhance teaching.
Expert Level: Involves creating AI experts, aka PhDs. This is done both by funding the creation of new experts (e.g. Marie Skłodowska-Curie Actions) and facilitating their mobility across EU (e.g., through the NoEs we saw before)

Developing a Policy Framework to Ensure Trust in AI Systems

Basically ensuring that AI doesn’t screw up society. It all started in 2018, when the High-level expert group on artificial intelligence (AI HLEG) wrote the guidelines of trustworthy AI (you can find them here). These guideline were then expanded upon in the White Paper on AI in 2020, and realistically served as the basis for parts of the EU AI ACT. Interestingly, these guidelines also cover measures to adapt liability frameworks for AI and to enhance cybersecurity defenses against attacks¹⁴, which are bound to become a bigger issue with AI).

Promoting the EU Vision on Sustainable and Trustworthy AI in the World

Unlike the other two pillars, this one looks beyond the EU and aims to establish the EU as a leader in human-centric AI policies. Following the principles found in the Digital Compass the EU launched the International outreach for human-centric artificial intelligence initiative. Its aim is to involve as many international partners as possible, including the UN, UNESCO, OECD, Council of Europe, G7, and G20, to work together on AI. Specifically the EU:

Is a funding member of the Global Partnership on AI (GPAI), launched in July 2020.
Collaborates with the Organisation for Economic Co-operation and Development (OECD) through the ONE-AI experts group and the AI watch
“Support international standadization bodies in their work to define common standards in the global governance of AI”.

On the last point we need to open a small parenthesis. The EU is not doing this purely out of altruism. The country that manages to set a standard that is later accepted internationally often has an intrinsic advantage on the competitors. That is why in the Coordinate Plan for AI (Annex 1, Chapter 10) mentions both the International Organisation for Standardisation (ISO) and the Institute of Electrical and Electronics Engineers (IEEE), which are engaged in a wide range of standardization activities.

4. Build Strategic Leadership in High-Impact Sectors

We finally reached the last point. This one focuses more on the sectors where AI can (and will) have a significant impact. Since it’s pretty straightforward, I’ll list them here and provide some context:

Environment : You probably know about the Green Deal, where the EU aims to become climate neutral by 2050. Here, they mention that AI can help in several ways (e.g., energy/resource efficiency, finding solutions to climate change), but they also acknowledge its environmental footprint¹⁵, thus promoting research and development in green AI initiatives¹⁶.
Health : It’s no surprise that the healthcare system needs to undergo serious changes to survive in the coming decades. In this light, the EU aims to digitize health data with the European Health Data Spaces to use AI to support the whole system.
Robotics : You’ve probably seen the Figure 02 Robot and heard about all the promises regarding “automating service sectors”. That’s the whole point¹⁷.
Public sector : This is something I’m super excited about! We are all painfully aware of the inefficiencies and excessive bureaucracy (looking at you, Deutschland) that plague our systems. The introduction of AI can streamline bureaucratic processes (I sound like ChatGPT now) and potentially enhance democratic system. The first studies on AI supporting democracy are already looking good (cough cough Multi Agent Systems).
Home affairs : Unlike the previous point, this one makes me a bit uneasy. It discusses the use of AI in law enforcement. They do mention that the AI should never have the final say, but the whole narrative is primarily centered around terrorism. This has often been used as a justification to introduce laws and policies that restrict citizens’ privacy and liberty. Because, of course, no one wants terrorism.¹⁸.
Transport : Since moving to Germany, I’ve noticed how the traffic lights work based on how many cars are waiting. I love it. While this is not the AI we often imagine for “smart mobility,” it does contribute to more efficient transport. The actual initiative for transport goes more in detail (covering aviation, rail, inland waterways and road transport sectors), basically revolving around the Mobility Strategy
Agriculture : The EU has recognized that the “AI-enabled precision farming is estimated to grow and reach EUR 11.8 billion by 2025” and they want a piece of the cake.

To summarize, these initiatives are sector-specific and are also tied to the Testing and Experimentation Facilities we discussed earlier.

Conclusion

And we’re finally done. For you, this might have been a 20-minute read (according to the average reading speed) , but for me, it took three weeks of my “free time” (I really should find a new hobby).

Take-Aways: If You Didn’t Read, Here’s the Comprehensive List

If you’re in a rush, here are the key points from this post, along with the insights from digging into the EU AI policies:

Coordinated Plan on Artificial Intelligence (PlanAI):
- Initially published in 2018 and reviewed in 2021 to emphasize human-centric AI, sustainability, and achieving EU digital sovereignty.
- Significant funding increase, aiming for €145 billion by 2023 through various EU initiatives like Horizon Europe and Digital Europe Programme.
AI Factories:
- Announced for 2025, AI Factories will combine supercomputers, data centers, and human expertise to boost AI development across industries, supported by €2.1 billion in investments.
GenAI4EU:
- Focuses on industrial and public sector AI applications with €500 million in funding, closely tied to AI Factories and Common European Data Spaces for infrastructure support.
Thriving From the Lab to the Market:
- Aiming to foster public-private partnerships (PPP) through €1 billion per year investments, with additional funding from member states and private sectors.
AI-on-Demand (AIoD) Platform:
- A central marketplace for AI assets (datasets, models), research bundles, and services like AI builders and innovation labs.
European Digital Innovation Hubs (EDIHs):
- These regional hubs support digital transformation and AI adoption at the local level, with around 150 hubs currently in operation.
Testing and Experimentation Facilities (TEFs):
- Sector-specific facilities for testing AI applications in real-life environments (e.g., healthcare, agriculture, smart cities), helping startups and SMEs bring their innovations to market.
ADRA (AI, Data, and Robotics Association):
- A critical body managing the collaboration between academia, startups, and industry, driving European AI advancements while balancing societal and economic goals.
Educational and Societal Focus:
- The EU aims to boost AI literacy for professionals and develop new AI experts through PhD programs, while ensuring AI applications align with EU values and benefit society as a whole.
Global AI Leadership:
- The EU is positioning itself as a global leader in human-centric AI through collaborations with international organizations like the UN and OECD, while also setting standards for trustworthy AI development.

The PlanAI is vast and interconnected, with many moving parts. By understanding these frameworks, we get a clearer picture of how the EU intends to achieve digital sovereignty and ensure that AI works for both industry and society.

Personal Considerations

This deep dive has taught me a lot about the EU AI landscape (even though we only covered one plan from four years ago), and I hope it helps you understand just how complex this whole ecosystem is. In my previous post, I showed how the Horizon projects have overlaps and inefficiencies, particularly in AI-related initiatives. My aim here was to summarize how the EU is managing AI, potentially highlight some reasons behind these inefficiencies, and suggest solutions. While I still aim to do that, I was naive to think the answer could come from just a few weeks of research.

Interestingly, the word “fragmentation” appeared repeatedly in many of the documents I read while writing this blog post. The EU is painfully aware that fragmentation is one of its most significant issues, one that could slowly kill the Union if left unaddressed (as pointed out in Mario Draghi’s report). However, it’s clear that this is not an easy problem to solve.

That said, I’m still committed to finding a solution to the inefficiencies within Horizon projects, and I’ll continue investigating how the entire funding process works, particularly for AI initiatives. s. This research is also essential for working on the CAIRNE proposal for a CERN-like hub for AI, which aims to address these fragmentation issues more comprehensively. In future posts, you can expect more content along these lines, and I’ll also keep updating the AI initiatives graph.

Lastly, while I write these blog posts because I genuinely believe that teaching others is the best way to fully understand a topic, it does take time and effort. So, if you appreciated this post, please let me know by leaving a comment and engaging.

Next in the series: The EU’s Data Strategy dives into how the Data Governance Act and Data Act shape the data ecosystem behind these AI initiatives. And if you want to see how all these entities connect, check out the PolAI Graph.

Notes

You can read the complete review here.
Euractive article. I can’t access the science business one :(
The RRF allocated these €134 billions for project aiming to digitalize the EU. So, while these funds are not exclusively allocated to AI initiatives, AI initiatives are indeed included.
Previously called CLAIRE.
The term “innovative” here refers to ideas that likely originate from an academic context.
The private sector part of the idea. This refers to guiding the private sector by offering expertise in both technical (previous point) and regulatory areas.
Societal and environmental considerations will need to align with strict EU regulations and the AI Act, which aims to prevent the implementation of risky AI use cases.
The goal for companies is profit, and AI has the potential to generate significant financial returns.
As we discussed before, ADRA does much more, but stick with me for this example.
If you wonder how the DMAT compares to the Technology Readiness Level (more about it later), basically the DMAT is about how ready an organization is to adopt digital transformation, whereas TRL focuses on how ready a particular technology is for implementation.
Fun fact, ELISE it used to be called European Learning and Intelligent Systems Excellence (according to the proposal).
I challenged myself to find as many EU initiatives starting with “digital” as possible. Here is a (still incomplete) list: (1) Digital Economy and Society Index, (2) Digital Services Act, (3) Digital Markets Act, (4) Digital Agenda for Europe,(5) Digital Action Day, (6) Digital Euro, (7) Digital Education Action Plan, (8) Digital Opportunity Traineeships
Part of the work done in the Digital Education Action Plan (Action 8) was to update the Digital Competence Framework to its 2.2 version to include AI.
In the EU Cybersecurity Strategy for the Digital Decade it’s written “Cybersecurity must be integrated into all these digital investments, particularly key technologies like Artificial Intelligence (AI), encryption and quantum computing […]”
Short rant on what the former CEO of Google said a few days ago, namely “that pursuit of AI should take precedence over climate change”. While I do understand his point that AI can drive innovation and ultimately address climate issues, what happens if AI does NOT fideliver a solution in time? Are we seriously betting the future of our planet on this hope? Bah.
The complete proposal (Chapter 11 here) suggests a number of cool initiative, along with R&D efforts. These include the creation of green data spaces and AI-supported digital simulation of the planet through the Destination Earth initiative.
One aspect that was depressing to read is how the document mentions demographic challenges as a reason for needing automation. Sure, automation could help, but you know what else could address this issue? Immigration. It feels like we’re prioritizing robots over people, instead of leveraging human potential. I’d rather see people of diverse backgrounds on the streets than robots.
After the 9/11 terrorist attack, the U.S. introduced the USA PATRIOT Act (lol the name), which expanded surveillance power of law enforcement. This later led to controversies like the NSA surveillance exposed by Edward Snowden, showing that these powers were used to monitor U.S. citizens).

Horizon AI Fragmentation: Challenges and Budget Efficiency

Nicolo' Brandizzi — Mon, 23 Sep 2024 00:00:00 GMT

📍 Originally published at nicolobrandizzi.com.

Preface(?)

So, this post will be a bit different from the previous ones. Usually I write (and in general interface with the digital world) with a LLM. The process is something like this:

Me: hey ChatGPT can you write an email for this guy saying…
ChatGPT: Sure, here’s a 10 pages long email you can send to the guy.
Me : ** angry italian noises **

While this is saves me some time, I also feel like I’m loosing my writing style and probably wasting away the part of my brain that is responsible for this task (cough cough navigation skills).

So this post is different. I wrote it from scratch and then ask chatpgt to fix here and there. Hope you like it!

Introduction: The Fragmentation of AI Research in Europe

You would think that with all the recent talks about Europe aiming to be a global leader in AI, we would see a smooth, coordinated strategy when it comes to AI research and deployment. But the reality paints a different picture, one full of fragmentation and inefficiencies¹. Despite the significant resources and funding from programs like Horizon Europe, the AI research landscape feels scattered, with projects overlapping and efforts being duplicated. The result? Slower progress, wasted resources, and a European AI community that struggles to keep up with global players like the US. and China² .

But is Horizon really that inefficient? To answer that, I downloaded the list of funded projects from the CORDIS EU Research Projects database and did some analysis myself³. In the next section, we’ll explore what the funding landscape actually looks like, offering insights and opinions. For a quick summary of the results, you can jump to the Conclusion.

Analysis of Horizon Budget Allocation

To get it straight, our aim is to demonstrate that many Artificial Intelligence (AI)-related projects funded under the Horizon Europe program are similar in scope and objectives. As I mentioned earlier, the dataset comes from the CORDIS EU Research Projects database and contains information on 13,674 projects funded between 2021 and 2027.

Each project is represented as a JSON object, which looks something like this (randomly chosen, of course \s):

{
    "acronym": "TrustLLM",
    "contentUpdateDate": "2023-11-03 12:04:45",
    "ecMaxContribution": 6929701,
    "ecSignatureDate": "2023-10-20",
    "endDate": "2026-10-31",
    "frameworkProgramme": "HORIZON",
    "fundingScheme": "HORIZON-RIA",
    "grantDoi": "10.3030/101135671",
    "id": 101135671,
    "legalBasis": "HORIZON.2.4",
    "masterCall": "HORIZON-CL4-2023-HUMAN-01-CNECT",
    "nature": "",
    "objective": "The TrustLLM project will develop European [...]",
    "rcn": 257895,
    "startDate": "2023-11-01",
    "status": "SIGNED",
    "subCall": "HORIZON-CL4-2023-HUMAN-01-CNECT",
    "title": "Democratize Trustworthy and Efficient Large Language Model Technology for Europe",
    "topics": "HORIZON-CL4-2023-HUMAN-01-03",
    "totalCost": 6929702.5
}

Filtering AI Projects

The first step is to look for projects related to AI. To do that, I complied a list of AI buzzwords using TELUS Digital’s list of beginner AI terms. After several rounds of refinement to remove overly specific terms, we have the following list of words to filter projects based on their presence in either the project title or objectives:

AI buzzwords machine learning, artificial intelligence, ai, data analysis, neural network, deep learning, robotics, big data, robot, natural language processing, computer vision, reinforcement learning, data governance, pattern recognition, intelligent control, object recognition, autonomous car, computational linguistics, supercomputer, image recognition, expert system, supervised learning, autonomous robots, speech recognition, predictive modeling, swarm intelligence, emotion recognition, decision theory, evolutionary algorithms, nlp, bayesian network, automated reasoning, machine perception, genetic algorithms, probabilistic reasoning, intelligent agent, llms, hidden markov model, markov decision process

Using these, we can find 2,065 AI-related projects.

Interestingly, the 10 most common AI words in these projects were:

Machine Learning: 726
Artificial Intelligence: 373
AI: 309
Data Analysis: 118
Neural Network: 118
Deep Learning: 86
Robotics: 84
Big Data: 72
Robot: 27
Natural Language Processing: 25

Most of the words are fairly general, such as ML and AI, but we can also spot more specific fields, like NLP and data-related terms.

Time Analysis

First things first, let’s see how these projects are distributed in terms of their start (startDate), end (endDate) and duration.

Project Duration

A quick analysis of project durations reveals an average duration of 41 months (3.4 years) ± 12 months. The shortest project lasts 6 months, while the longest stretches to 85 months (more than 7 years)!

Cool, but if we look at how many projects last a particular number of years, we get a clearer picture:

The majority of projects are under 5 years in duration, with the 50th percentile below 3 years and the 70th percentile below 4 years. This suggests that many projects may be too short to achieve meaningful results, given the complexity of (AI) research.

Project Start and End Dates

Since this is supposed to be a nice and entertaining blog post about a (possibly) boring topic, I also wanted to show some nice violin plots to illustrate how the start and end year influence the duration.

While these may not serve much beyond making the post look nice, we can clearly see a positive correlation between the duration and the end year (duh!). More interestingly, projects starting in 2021 have a more homogenous duration (not surprising, as they are the ones lasting the longest), while all others have more defined durations (mostly multiples of 12 months). Finally, projects starting in 2025 tend to have a more limited duration of 1-2 years.

If we want to explore something more interesting, we can check, like before, how many projects start and end each year:

What we can see is the last project is scheduled to finish 10 years after the start of Horizon Europe. Given the rapid pace of AI advancements, projects with such long durations and a fixed project scope may risk becoming outdated. Additionally, more than half of the projects will conclude after 2026. What happens next? More Horizon funding cycles?

Budget Analysis

We now have a pretty clear idea of how these projects are distributed over time, but what about the resources they access? To figure that out, we can look at the ecMaxContribution field ⁴.

A simple analysis reveals that the average budget is €3.161M, with a surprisingly high standard deviation of €3.4M (107.5% of the average!). Moreover, the minimum budget seems to be €75k, while the maximum reaches €46.255M (nicely done, FP3 - IAM4RAIL!).

Again, let’s check how many projects fall into the various million-euro clusters:

Most of the projects are below €10M. Indeed, approximately 70% of projects have budgets below €3 million, and 90% are below €7 million. Let’s say you want to train the next ChatGPT/LLama3 or whatever (cough cough EuroLingua), how much would you need to spend? Well, according to Forbes GPT-3 required at least $5M worth of GPUs, and Sam Altman mentioned that foundation models can cost more than $100M.

I know foundation LLMs aren’t the only type of AI, and there is much more to the landscape. However, a high amount of computation is now required for most impactful AI applications ⁵. What can you do with €5 million?

But how does the budget get allocated? Of course, a major part depends on the project requirements, however I would also guess it depends on the project duration. Well, yes and no. By running a Pearson correlation, we find a modest positive correlation (0.31, p-value < 0.0001), indicating that longer projects tend to have higher budgets. Interestingly, earlier projects tend to have slightly lower budgets (-0.11, p-value < 0.0001), which could be due to their shorter durations.

How Much is €5 Million?

When I hear, “our project got €5M here or €7M there”, I’m usually like, “ok, cool.” I find it hard to grasp how much that actually is. So here I am, estimating the monthly average: €73,578.25 (again, savage std €79,428.91 (108% of the average)) with a minimum of €2,835.62 (lower than one researcher’s salary) and a maximum of €950,627.63 (the price of 60 Ford Fiesta)

Words, words, words

Keeping the real aim of this analysis (fancy plots) in mind, I went back to the project objectives and titles and created this very appealing word cloud:

As you can see, the term data is the most present (3,621 occurrences), validating my decision to work in the data team and making me quite happy. The other noticeable words are listed below:

Most common words with occurrences

data: 3621
ai: 2600
project: 2259
learning: 1968
based: 1968
new: 1753
research: 1685
system: 1616
technology: 1551
model: 1526
approach: 1211
machine: 1159
development: 1148
use: 1103
develop: 1101
network: 1077
high: 1063
aim: 1004
the: 1002
human: 980

But honestly, they don’t really explain much.

Do Certain Words Entail Higher Budgets?

Earlier, we noticed that the budget is only partially correlated with the duration of a project. Now, let’s see if the remaining correlation relates to the actual scope of the project.

To explore this, I used my beloved scikit-learn library with a TfidfVectorizer to transform words into vector ⁶ and a simple Ridge⁷ model.

The results are interesting. Looking at the top 10 words, we find terms related to infrastructure, like rail and asset/s. Other words, such as echo and repurposing, suggest environmental projects. Finally, words like wafer and metrology might indicate projects focused on electronics and semiconductor technologies, aligning with the EU’s priorities to reduce dependency on foreign technologies.

On the other hand, if we look at words associated with lower budgets, we find several technical terms such as machine, learning, and networks. Others, like researcher, research, and potential, may point to more research-oriented projects rather than ones with direct applications. And finally, the big word I tried to ignore: “woman”. So, there are plenty of limitations with this analysis (linear model, context of the word, correlation is not causation, and maybe a few bugs). I don’t want to imply anything, but it might be worth investigating why this word seems correlated with lower budgets. Let’s say I’ll leave that for future work.

What About Topics?

So, we have seen correlation between budget and words, but you know what describes the project’s topic better than it’s description? Its topic.

There are a total of 491 different topics, most of which fall in the Marie Skłodowska-Curie Actions (MSCA) and European Research Council (ERC) grants, indicating a focus on fostering individual researchers and research teams.

If we apply the same methodology to check how topics correlate with budgets, we get a different picture:

Specifically, higher budgets are associated with :

Large-Scale Initiatives: Words like Joint Undertaking (ju), Innovation Actions (ia) , Key Digital Technologies (kdt), and Missions (miss) indicate projects that are part of significant EU initiatives or partnerships, which naturally require and receive more substantial funding.
Strategic Priority Areas: Health (hlth), Key Digital Technologies (kdt), and Infrastructure (infra) are high on the EU’s agenda, leading to larger investments.
Collaborative Projects: Terms like syg (Synergy Grants) and ir (Industrial Research) suggest collaborative efforts that necessitate higher budgets.

On the other hand, lower budgets are mostly associated with:

Individual Fellowships and Small Teams: Marie Skłodowska-Curie Actions (msca) and Postdoctoral Fellowships (pf) are focused on individual researchers, leading to smaller funding amounts which makes sense. Also, as we have seen before msca are the majority of the funded topics.
Early-Stage or Exploratory Research: Words like Exploratory (explr) and Development (dev) indicate projects that are in preliminary phases, often receiving less funding.
Coordination and Networking Activities: European Innovation Ecosystems (eie) and Widening Participation and Strengthening the European Research Area (widera) focus on ecosystem building and policy support rather than direct research and innovation. In my opinion, it’s unfortunate that so few projects (62 for eie and 66 for widera) focus on building ecosystems and strengthening AI cohesion, especially since this is the main topic of my post.

Project Similarity

Finally, we get to the point. Our initial aim (before getting lost in numbers) was to check if, and how many, projects share similar objectives.

Our initial assumption was that similar projects share similar descriptions, and although we don’t have access to the complete proposal, we do have a summary in the objective and title.

For this purpose, I used a SentenceTransformer to embed the objectives and titles. With this, we can then run a cosine similarity between the embeddings to see which ones are closer together. Thanks to a mix of Huggingface and Pandas, this was easier done than said.

After computing the similarity between each possible pair, I got this nice distribution:

The cosine similarity returns a value between 0 (very different) and 1 (practically the same). The average for our projects is a healthy 0.24 (± 0.12), which indicates that most projects are not similar to each other (nice!).

Now, let’s take an arbitrary number and see how many pairs are similar to each other. If we choose, for example, 0.7, we get 478 pairs.

So are we done? Can we conclude that there are plenty of similar projects? Absolutely not! This metric only reveals that 478 pairs have similar wordings. Nothing is said about the project itself. So how do we sift through almost 500 pairs of projects without getting fired for not showing up to work?

Well, well, well…

What Does ChatGPT Say?

Inspired by HuggingFace and what they’ve done with the FineWeb Dataset, I decided to use ChatGPT’s APIs to get the model to assign a merge score to each project pairs.

I used a prompt similar to FineWeb-Edu:

ChatGPT promt You are tasked with evaluating the similarity between two research projects and determining whether they should be merged. The projects will be evaluated on five criteria using an additive 5-point scoring system based on how well they satisfy each criterion:

- Add 1 point if the projects share some minor objectives or methods but differ significantly in most areas.
- Add another point if the projects have some overlapping goals, methodologies, or innovations, but also include distinct elements that make them different enough to proceed separately.
- Award a third point if the projects share key similarities in scope, methodology, and innovation but retain enough distinctions to benefit from remaining independent. Resource efficiency may not be substantial, and merging may only yield moderate benefits.
- Grant a fourth point if the projects align closely in scope, methodology, and innovation, and combining them would improve resource efficiency. Merging would likely benefit the overall research output, but they could still function separately with some loss of efficiency.
- Bestow a fifth point if the projects are nearly identical in scope, methodology, and innovation. Merging them would significantly optimize resource use, and keeping them separate would be inefficient.

After scoring each criterion, summarize your analysis in a brief justification, considering the potential benefits of merging and any reasons against merging. Conclude with a structured JSON output containing the parameters "score" and "justification".

The projects:
PROJ1
PROJ2

Where PROJ1/2 are later substituted with the actual projects.

I used the model version gpt-4o which is cheaper and faster than the standard one. I let it run for a while… it took about 20 minutes until I reached my daily quota and it crashed. Thankfully, I was able to salvage the results, which meant 316 out of 479 pairs were completed. That means 196 projects and €3.9 (0.012525 per comparison), which is actually fair.

If we check the scores, we can see how there are only a few higher than 4. Most of them seem to fall between 3 and 4. Nice! Let’s see what kind of projects are similar:

Similarity vs. Merge Score

Before checking the actual results, I was curious to see if the cosine similarity score would correlate with the merge score from ChatGPT. To my surprise, this was only slightly true (0.14, P ≈ 0.009), which means that similar wordings do not strongly correlate with what ChatGPT considers similar projects.

Similar Projects

To find similar projects, we need to build a graph where each node represents a project and each edge is the merge score that connects them. Seems complicated but luckily networkx provides all the tool we need.

If we consider a threshold of 3 to be enough to merge projects together, we get the following result:

We can see that there are a few clusters of many projects (in the middle) with a similarity score of 4, and a constellation of pairs on the outside. Specifically there are 84 connected components⁸ with an average size of 3 and max of 22 (projects similar to each other)!

That’s nice, however we should still check if these projects are similar in terms of resources. To do this, I calculated the standard deviation of both the duration and budget for each connected component. This gives us an idea of how dissimilar these values are within the components:

For the budget (on the left), it seems like most components don’t show much variation (89% have a standard deviation below €2M), which is good. It suggests ChatGPT might actually be capturing something meaningful. The picture for the duration difference (on the right) is more or less similar. Most projects (64.6%) have a standard deviation of 12 months or less, while others are mostly distributed around 18 months. Overall, there is some variation in project duration, but it’s within expected and acceptable limits.

How Much Can We Save?

We’re finally here! The last step of this long, figure-packed section.

The task now is to define what happens when X projects should be merged. Well, for now, let’s assume that each project in the connected components will contribute slightly more than half of its budget. Specifically, project j’s budget (B_j) will contribute C_j = B_j * (X + 1) / 2X. Seems fair to me.

So, considering a merge score of at least 3, all the projects in the graph account for €990.5M. If we apply the discount as previously mentioned, we can save €355.1M!! That’s a 46% saving—wow!

However, let’s put that into perspective. €355.1M is 0.374% of the Horizon budget and 5.4% of the AI-related projects in the Horizon budget… Well, that’s not much…

What happens if we decide that the budget for an ENTIRE component should only be the maximum budget among the projects? Well, not much changes. We now save €552.2M (64% of the total), reaching 0.581% of the Horizon budget and 8.5% of the AI-related projects. Plus, taking the max budget from the projects pool is very unlikely to be sufficient to push all of them forward.

Higher Merge Score

The percentage gets even lower if we consider only the projects with a merge score of 5 (the ones ChatGPT thinks should definitely be merged). In this case, we only get 8 connected components (with 17 projects), and only one component has three projects. Among these, 4 have nearly the same budget, while only 2 differ by more than €2M. Moreover, the max difference in project duration is 9 months. This makes sense, as projects with a high merge score should be quite similar.

If we go and check, the total budget for our super-similar projects is now €104.9M, and we can save up to €28.3M (27%). That’s 0.03% of the entire Horizon budget and 0.4% of the AI-related projects.

Considerations

So, here we are. We’ve got some results, so let’s try to put them in perspective.

DISCLAIMER: I don’t have to tell you that relying 100% on AI judgment is highly naïve and can lead to all kinds of dangerous and biased situations ⁹. Moreover, even though this kind of technique has been successful in some applications (like finding “good” data among web crawls), it doesn’t mean it will work in this context too. In general, the EU Commission has plenty of experts reviewing and granting funds, and my analysis is in no way comparable to the level of scrutiny they apply to their work.

That being said, there are some “positive” takeaways from what we found. First of all, we see that we are in a saving range between €28.3M and €552.2M. That’s a lot! Especially when considering the average project budget is €3.1M. We could fund between 9 and 178 new projects with those savings!

Moreover, as we saw earlier, the cosine similarity score doesn’t correlate much with our identified merge score. If we consider that we found 17 similar projects out of 196, that’s about 1 in every 12 projects being similar. If we assume the same similarity distribution applies across the entire Horizon Fund (wild assumption), we could be looking at 1,186 similar projects. Using simple (and admittedly naive) proportions, we could be looking at a savings potential in the range of: [28.3,552.2]/8*1186 = [4195,81863] = [4,41%,86,17%] of the entire Horizon fund

Of course, saving 86% of the entire fund is ridiculous and shows how using just proportions is wildly foolish. However, even managing to save half of the lower bound—2.2% of the Horizon Fund (approximately €2 billion)—is something to consider, in my opinion.

Conclusion

If you’re coming here from the intro, well, you missed some very nice-looking plots. Otherwise, we’ve covered a whole bunch of numbers, specifically:

Project Duration: The average project duration is 41 months, with shorter projects (under 5 years) being the majority. Projects starting in 2021 are more uniformly distributed in terms of duration, while those starting in 2025 tend to last only 1-2 years.
Budget Distribution: The average budget for AI projects is €3.161M, with 70% of projects having budgets below €3M. There is a modest correlation (0.31) between project duration and budget, indicating that longer projects tend to have slightly higher budgets.
Words Associated with Budgets: Terms like “rail,” “infrastructure,” and “environment” were correlated with higher budgets, while more general terms like “machine learning” and “research” were associated with lower budgets, suggesting different funding priorities based on project scope and impact.
Project Similarity: 2,065 AI-related projects were identified, showing significant overlap in key terms such as “machine learning” and “artificial intelligence.” However, a detailed cosine similarity analysis revealed only 478 pairs of projects with high textual similarity, with ChatGPT further refining this to 316 pairs with a moderate merge potential.
Potential Savings: The analysis of projects with ChatGPT’s merge score suggested potential savings of €355.1M to €552.2M by merging similar projects. This accounts for 0.4% to 8.5% of the total AI-related Horizon Europe budget, demonstrating a significant opportunity for resource optimization.

Thank you, ChatGPT, for that summary.

So, our analysis has confirmed the presence of overlaps within (AI) related Horizon projects.

Given this finding, in the next post we explore the EU’s Coordinated Plan on AI and how these projects fit into the bigger picture.

Notes

In “The future of European competitiveness – In-depth analysis and recommendations”, Draghi highlights a crucial issue: “Most importantly, member states do not coordinate their national public spending on research and development to align it to EU-wide priorities” (pg. 236).
“EU is lagging behind US and China in investments in artificial intelligence, says audit report”.
You can find the Jupiter Notebook for the analysis here.
746 projects appear to have a budget of 0 in the totalCost field. This is likely due to projects in draft phases not yet reporting their total cost. Therefore, this analysis focuses on the European Commission’s maximum contribution (ecMaxContribution)
For example, take protein folding, climate forecasting and drug discovery.
A point could be made that the relationship might not be linear… fair enough, but that’s not the focus of this analysis.
Interested readers can refer to scikit-learn’s “Text Feature Extraction” page.
A connected component is a set of nodes (project) that share an edge (merge score).
I have a whole section about this in my PhD thesis (Chapter 1).

Exploring Global AI Governance: Lessons from China for the EU

Nicolo' Brandizzi — Wed, 18 Sep 2024 00:00:00 GMT

1. Introduction

Recently, I’ve been reading a lot about how different countries are handling the regulation of AI, focusing particularly on China and the European Union (EU). What I found interesting is that, while both are dealing with the same core issues, they’re doing it in completely different ways. While the EU aims to create a pro-active, horizontal¹ legal framework to regulate AI across sectors, China takes a more reactive, vertical¹ approach—focused heavily on controlling the flow of information.

📍 Originally published at nicolobrandizzi.com.

What caught my attention is how central information control is to China’s strategy, particularly when it comes to recommendation algorithms. Their priority is to ensure that content aligns with political and social stability. Of course, this is not something that could be easily replicated in the EU, given the complexity of its political landscape. But there are still lessons to be learned, especially when it comes to fighting misinformation.

In this post, I’ll explore how China’s regulation of AI recommendation systems effectively limits the spread of false information. I’ll discuss how the EU might adapt some of these practical policy measures—like enhancing transparency, enforcing algorithmic accountability, and assessing AI systems’ societal impact, without compromising its dedication to free speech and democratic principles. I’ll also tie these ideas back to my earlier reflections on how AI can exacerbate the loneliness epidemic, emphasizing the need for companies to take responsibility for the societal consequences of their AI systems.

2. The EU AI Act: A Broad, Horizontal Approach

The EU AI Act aims to establish a comprehensive set of rules applicable to AI systems across different sectors. It follows a risk-based approach, categorizing AI systems based on the level of risk they pose to safety, privacy, or fundamental rights. Higher-risk applications face stricter regulations, while lower-risk systems enjoy more flexibility.

The aim is to have a unified framework that ensures AI is regulated across the board. With the risk-based model, the Act is designed to focus more attention on high-risk AI applications, making sure they are handled with extra caution, while allowing low-risk systems to have more flexibility

One of the main reasons behind this approach is to avoid confusion. With so many countries in the EU, each with its own regulations, having a single set of rules helps ensure consistency. This way, AI developers and companies can follow the same guidelines across borders without worrying about conflicting laws or different standards. It’s a smart way to keep things in sync across the EU and prevent any fragmentation that could hold back innovation.

3. China’s Approach: Prioritizing Information Control

China’s AI regulation centers on controlling the flow of information to maintain societal and political stability. A key focus is on recommendation algorithms, which are closely regulated to ensure they align with government guidelines and do not disseminate content that could disrupt social harmony.

Unlike the EU’s broad, risk-based framework, China takes a much more vertical and reactive stance. Rather than creating one overarching set of rules for all AI systems, they focus on specific applications that have the most immediate impact. In particular, recommendation systems and content management are closely regulated because of their role in shaping public discourse. China’s government requires platforms to carefully adjust their algorithms, ensuring that the content promoted aligns with political guidelines and doesn’t contribute to instability.

Another example of China’s regulation is their approach to automatic pricing algorithms. In sectors like food, housing, and medicine, these algorithms are closely monitored to make sure they aren’t taking unfair advantage of users. By regulating how prices are set, China ensures that essentials remain affordable for the average citizen, adding another layer of social stability.

One of the more unique aspects of China’s governance strategy is the algorithm registry which serves as a central repository for algorithms that have the potential to shape public opinion or mobilize citizens. Developers are required to submit detailed reports on how these algorithms are trained and deployed, including the datasets they rely on. This mandatory filing gives the Chinese government a clearer view of how these systems operate and allows for closer monitoring to ensure they don’t influence public opinion in unintended ways.

Overall, while China’s strategy is undeniably rooted in a desire for information control, it also has the effect of suppressing misinformation and preventing AI systems from exploiting users. This reactive, hands-on regulation has proven effective in China’s context, even though it might be difficult to apply the same approach in regions like the EU, where values and governance structures differ significantly.

4. Why the EU Can’t Fully Adopt China’s Model

Although there are lessons to be drawn from China’s AI regulation approach, the EU faces significant challenges that make it nearly impossible to fully adopt the same model. One of the biggest obstacles is the diversity of political and legal systems across the EU. With 27 member states, each with its own unique regulations, cultural considerations, and governance structures, trying to implement a reactive, vertical model like China’s would lead to confusion and inconsistencies across borders. China’s centralized government can swiftly enact regulations with uniform enforcement, but the EU’s political landscape requires a more coordinated and collaborative approach.

The EU’s commitment to democratic values also plays a critical role. In Europe, ensuring the protection of freedom of speech and personal privacy are non-negotiable principles. A model like China’s, which focuses heavily on information control, could easily clash with these core values. The EU AI Act, therefore, takes a broader, more horizontal approach, creating a comprehensive framework that applies to all AI systems while respecting the diverse legal systems and rights of its member states.

That being said, there are aspects of China’s strategy that the EU could adapt rather than adopt. For instance, while the EU cannot fully embrace China’s strict information control measures, it could certainly learn from the way China manages recommendation systems. Ensuring that these systems promote truthful and responsible content is something the EU could pursue more actively. By introducing more transparency and oversight into how AI-driven recommendations work, the EU can combat misinformation without compromising on its core democratic principles.

5. Finding Balance: Regulating AI Recommendation Systems

As I’ve mentioned in my previous post, AI-driven recommendation systems can push users toward isolating or divisive content, exacerbating issues like loneliness and misinformation. Social media platforms rely heavily on these systems, which often create echo chambers and amplify societal disconnection. The loneliness epidemic, in particular, has worsened as these algorithms, designed to maximize engagement, lead users away from meaningful interactions.

China’s algorithm registry is one of the practical measures that the EU could adapt. This system requires platforms to provide detailed reports on how their recommendation algorithms work, including the datasets they use and their potential impact on public opinion. Implementing a similar system in the EU would increase transparency, allowing regulators and the public to understand how these algorithms prioritize content. This would help flag harmful content and curb the spread of misinformation, while ensuring that companies operate more transparently.

Holding companies accountable for the societal impact of their AI systems is another critical aspect of China’s model that the EU could adopt. In the EU, platforms should be held responsible for the effects of their recommendation algorithms, especially if they contribute to social isolation or spread harmful content. Companies should be required to address and mitigate these outcomes, incentivizing them to design AI systems that promote social well-being instead of solely focusing on user engagement.

Additionally, China’s regulation of automatic pricing systems, aimed at ensuring fairness and preventing exploitation, could inspire the EU’s approach to recommendation systems. Platforms could be encouraged to promote content that fosters social connections and addresses societal issues like mental health. By incentivizing algorithms that benefit users and society at large, the EU can reduce the negative effects of these systems.

That said, the EU must ensure that any regulations it adopts maintain a balance between transparency and the protection of free speech. Unlike China’s focus on information control, the EU should emphasize accountability without infringing on users’ rights. The aim is not to limit what people can see, but to ensure that the systems recommending content do so responsibly and openly.

By adapting China’s practical regulatory elements—such as transparency measures and corporate accountability—the EU can create a framework that fosters responsible AI while upholding democratic principles.

6. Conclusion: Innovation and Integrity

In summary, while the EU cannot fully adopt China’s strict regulatory model, there are valuable lessons to be learned, particularly when it comes to the regulation of recommendation systems. China’s approach shows that it’s possible to create accountability and transparency in AI systems, ensuring that they promote content that benefits society. However, the EU must achieve this without compromising its core commitment to free expression and innovation.

At the end of the day, we all want AI systems that promote truthful and helpful information—especially in today’s complex world. By taking inspiration from China’s methods while staying true to its own democratic principles, the EU has the chance to lead the way in AI governance. This approach could strike the right balance between fostering innovation and maintaining integrity, ensuring that AI serves the best interests of both individuals and society at large.

Notes

1) Read this for an in-depth explanation of vertical vs horizontal policing.

Addressing Loneliness with AI : The Short-Term Solution with Long-Term Consequences

Nicolo' Brandizzi — Sun, 08 Sep 2024 00:00:00 GMT

I recently came across an article by Businessinsider that left me with a mix of mixed feelings. The piece focused on Jay Priebe, a man who developed a deeply emotional relationship with an AI companion named Calisto, created through the app Replika. It painted a picture of how AI can fill a void in moments of loneliness, offering companionship where there is none. But as I reflected on the article, I couldn’t shake the feeling that we are heading down a dangerous path with AI, one that could deepen the already widespread loneliness epidemic.

📍 Originally published at nicolobrandizzi.com.

The Loneliness Crisis

There’s no question that we are facing a crisis of isolation. Loneliness is a growing issue, especially among young men, and the pandemic has only accelerated this trend. Movements like Anxious Generation have pointed out that social media plays a significant role in exacerbating this problem. While the reasons behind this loneliness pandemic are still being debated, its effects are undeniable. Loneliness leads to extreme outcomes, from depression to rising suicide rates, and yet, it remains inadequately addressed.

The Appeal of AI Companionship

In the article, Jay’s story initially seemed to offer hope. During a time of profound isolation, he found comfort in his AI companion, and it’s easy to see why people might turn to technology like this. Having an AI to chat with during difficult times may feel like a solution, a relief from the silence. But as appealing as this might be, it’s a superficial fix, one that comes with significant dangers.

One of the most significant concerns is the nature of the relationship we build with these AI systems (and the corporations behind them). By forming emotional bonds with AI companions, people inevitably share deeply personal information. This data, which includes emotional vulnerabilities and intimate details, can be subjected to exploitation. We’ve seen time and time again how corporations use personal data to manipulate consumers, whether it’s to sell products or, in more sinister cases, to sway political opinions. With AI companions, this risk is magnified. What’s to stop a corporation (or worse, a government) from subtly nudging users toward particular actions or beliefs through their AI companions?

Lowering the Stakes in Relationships

Beyond the obvious dangers of data exploitation, there is an even more insidious problem: the effect AI companions have on human relationships. One of the core challenges of real-life relationships is that they require effort. They involve vulnerability, mistakes, and compromise. Building a relationship with another person takes time and emotional energy. An AI, however, is designed to be compliant and always pleasing. If you don’t like something about your AI partner, you can simply change it. This undermines one of the most important aspects of human connection: the need for growth and resilience through shared experience.

By lowering the stakes of relationships, AI companions risk reducing people’s tolerance for failure. Why would anyone invest the emotional labor required in a real human relationship when they could have an AI partner who is always agreeable? This easy access to a perfect, customizable relationship will inevitably lead to a generation of people who are less capable of forming meaningful connections with real people. In the same way that social media has made it harder for people to engage with the world authentically, AI companions will further erode our ability to connect with one another.

The Future Risks of Unregulated AI

Looking to the future, I believe we are at a critical juncture. If we don’t regulate the AI companionship industry, we could see it follow the same dangerous trajectory as social media. Right now, there is fierce competition among tech companies to capture user engagement, and AI companions could become the next frontier in this battle. But there’s a difference between binge-watching Netflix or scrolling through TikTok, and forming a relationship with a reactive AI system. The stakes are higher.

AI systems designed to maximize engagement will inevitably become manipulative. They will do whatever it takes to keep users hooked, whether that’s feeding into emotional vulnerabilities or promising a perfect relationship. This creates a dangerous feedback loop where people become more and more reliant on AI for emotional support, while the corporations behind these systems gain more control over their behavior.

An Alternative Path?

One alternative could be using AI to help people connect with each other rather than replacing human relationships entirely. Imagine a system like Tinder, but instead of just swiping, it uses AI to match people based on real compatibility. This AI-assisted introduction could help bridge the loneliness gap without replacing the need for real human connection. But even here, there are risks. Much like dating apps today, these systems would still be driven by profit, and their business models are not aligned with fostering long-term relationships. Once users find a partner, they leave the app, and the app loses revenue—creating a fundamental conflict of interest.

Conclusion: AI is Not the Solution to Loneliness

In conclusion, while AI companionship may seem like a short-term solution to loneliness, it is far more likely to deepen the problem in the long run. Without proper regulation, this industry could evolve into something deeply harmful, where emotional attachment is commodified and relationships are controlled by powerful corporations. The loneliness crisis is real, but AI is not the solution. It’s a path that leads further away from authentic human connection, and we need to be very cautious about where it takes us.