400 MHz and Still Missing Deadlines
Part of the series: From Silicon to System — Microcontroller and SoC Architecture Deep Dive
Spec sheets lie. Not intentionally — but they only show ideal conditions, not your environment. That gap can cost you your deadline.
I learned this when I was working on a project where we needed to sustain a specific throughput over QSPI and SPI simultaneously while keeping CPU utilization as low as possible. The QSPI peripheral was clocked at 80 MHz in quad mode, so the theoretical peak is 320 Mbps. Even accounting for protocol overhead, I was expecting somewhere around 160 Mbps in practice. We were seeing roughly half that.
Nothing in the datasheet explained it. The DMA was enabled and measurably faster than the polling version. But the numbers just didn't land where they should have.
What the profiling revealed was the scheduling complexity underneath. DMA channels we hadn't allocated ourselves were already active, claimed by middleware running below our layer. Making sure each frame's transfers were kicked off correctly, with enough spacing that nothing collided, was far more involved than it should have been. We got it working. The exact reason the throughput ceiling sat where it did remained, honestly, a bit open.
The bus topology has invisible occupants. You don't see them in your code. You don't see them in the datasheet. You only find them when the numbers don't add up.
That experience reframed how I think about MCU performance. The headline clock speed (400 MHz, 480 MHz, regardless of what the datasheet advertises) is a ceiling. It tells you the fastest the CPU can retire instructions under ideal conditions. It tells you nothing about how fast it will under your workload. The gap between those two numbers is where deadlines die, and it's almost always architectural.
Bus Contention: The Tenants You Didn't Know About
The QSPI story is a clean illustration of a broader architectural truth: the AHB interconnect is a shared resource, and your application is never the only tenant.
DMA controllers, USB peripherals, Ethernet MACs, and vendor middleware layers all issue bus transactions. When a DMA burst is in flight, your CPU's memory requests queue behind it.
On a well-loaded system with multiple active DMA channels, these collisions happen constantly, and because they're data-dependent, the resulting stalls aren't deterministic. The control loop hits its deadline most of the time, misses it occasionally, and the timing trace looks clean until it doesn't.
What makes this particularly hard to reason about is that the hidden occupants often live below the abstraction layer you're working at. The HAL schedules its own DMA traffic. The RTOS may be using a DMA-backed memory copy. The USB stack has its own transfer pipeline.
None of this is in your code, but all of it is on your bus.
Flash Wait States: The Instruction Famine
Internal flash on most MCUs tops out somewhere between 30 and 100 MHz. At 400 MHz your CPU is 4x to 13x faster than the memory feeding it instructions. Without mitigation, every instruction fetch stalls the pipeline while the flash controller catches up. That's 3 to 12 wasted cycles per fetch, before your code has done a single useful thing.
Most vendors address this with a prefetch buffer or instruction cache. When the hot code fits in cache and runs linearly, you approach zero-wait-state throughput.
When it doesn't (on cold start, after a branch to uncached memory, or when your ISR footprint is larger than the cache), you're back to paying the raw flash penalty. The transition between these two states isn't visible in the source code. It shows up only in cycle counts, and only when you're measuring.
Pipeline Hazards: The Cost of Unpredictability
ARM Cortex-M pipelines have 3 to 5 stages. Keeping them full requires a predictable instruction stream. Control code (conditionals, branches, pointer dereferences) is the least predictable kind of code there is.
A taken branch flushes the pipeline and costs 1 to 3 cycles while it refills. A load instruction immediately followed by a use of the loaded value creates a load-use hazard: the pipeline stalls for a cycle waiting for the memory read to complete.
In tight PID or state-machine code, where the pattern is read a register, compare it, and branch on the result, these hazards stack. They don't show up as a single large penalty. They accumulate as a 10 to 20% throughput tax spread invisibly across the hot loop.
The disassembly is almost always more expensive than the C suggests. Five lines of source can expand to 20 instructions once the compiler adds pointer loads, sign extensions, and spills. The C gives you a logical view of the computation; the assembly gives you the actual cost.
Measurement Is Necessary. It Is Not Sufficient.
When we finally profiled the QSPI pipeline, the data told us where the time was going. It took the architectural mental model to tell us why: why the HAL was slower than the transfer rate implied, why inter-frame gaps were eating bandwidth, why the contention was non-deterministic. Without that model, the profiler output is just a collection of numbers. With it, each anomaly points to a specific cause.
The datasheet gives you the ceiling. Measurement gives you the floor. The architecture tells you what's between them, and which parts of that gap you can actually close.
This is the gap that separates engineers who debug by intuition from engineers who debug by understanding. The answer to a missed deadline is almost never a faster chip. It's almost always a more accurate model of where the cycles go.
