All posts
benchmarksmethodologytokensproof

How we benchmark Axint honestly

What the benchmark page measures, how the token estimates are computed, what they do and do not prove, and how to rerun the compiler-side benchmarks yourself.

Nima NejatThursday, April 16, 20266 min read

Benchmark pages get weird fast.

The easiest version is to cherry-pick one flashy example, publish a giant percentage, and call it a day. That's not what we're trying to do with Axint.

What our benchmark page actually measures

The public benchmarks page measures one very specific thing:

How many tokens a compact Axint definition uses compared with the Swift that would normally need to be generated for the same Apple-native surface.

The current public page uses three representative surfaces:

The token counts are computed at build time from code that is visible in the page source. There is no client-side hiding and no hand-entered marketing number floating above the examples.

What the estimate is based on

We use the simple four-characters-per-token approximation for source code.

That is not a provider-specific billable number. It is a stable comparison method. The goal is not to pretend we know your exact invoice down to the cent. The goal is to compare compact Axint authoring against the larger Swift output the model or human would otherwise need to carry around.

What this does prove

It proves that the authoring surface is materially smaller.

That matters because agent loops pay for verbosity twice:

If a tool can express the same Apple-native feature in a smaller surface area, it has a real systems advantage.

What it does not prove

It does not prove that every possible Apple feature compresses by the same ratio. It does not prove build-time performance. It does not prove total product ROI by itself.

That's why we keep this page narrow. It is token-efficiency proof, not a catch-all performance claim.

How to rerun the compiler-side benchmarks

The open-source repo also includes compile-time benchmarks for the compiler itself:

bash
npm run bench

Those measure compilation throughput across representative fixtures. They answer a different question than the public benchmark page, but they matter for CI and regression detection.

The standard we're aiming for

If a public proof page can't be explained in a minute, it usually isn't honest enough yet.

So the Axint version is deliberately simple:

That's the bar we want to keep raising.