Can AI Navigate by Dead Reckoning?

February 2026 Update:

We’ve heard your feedback that GPT 4.1 is now too old to be a good benchmark! We’ve revamped this post to show you how Opus 4.5 performs on basic data science tasks.

Here, we show both Sphinx (left) and Opus 4.5 on GitHub Copilot (right) some data with bimodal structure. Opus confidently fits a single line to data with clear structure. The code runs, but the result is shoddy.

Data Science isn’t just a question of brute-force model improvement, especially when those improvements are increasingly targeted at software engineering benchmarks. By actually understanding data at a nontrivial level, Sphinx delivers analysis that is on-par with humans, letting experts massively scale their impact.

November 2025:

In 1707, Admiral Cloudesley Shovell thought that he was 200 miles off the coast of England, and set off on a straight line for port. His bearing was off by just half a degree, but this was enough to crash his fleet into the Isles of Scilly, sinking four ships and losing over 1,400 sailors.

Over 60 years later, Captain James Cook instead took readings of longitude at least 6 times every day. With this precise introspection, he was able to accurately chart a course between tiny islands across the massive Pacific Ocean.

For both sailors and AI agents, the rate and quality of introspection can unlock new capabilities. For example, consider a home price dataset where 30% of the square footage numbers are random noise close to zero.

By making large logical leaps and not checking each step with care, agents (in this case, GitHub Copilot) can miss this discrepancy and draw invalid conclusions. Misled by corrupted data, the agent claims that house prices grow by $27/sq ft … a drastic underestimate that our NYC-based team very much wanted to believe.

When we invoke Sphinx AI on the same task, we can navigate around bad data with precision:

Sphinx AI is able to identify and then isolate unreliable square footage data. This allows our agent to build a robust model that correctly estimates around $123 / sq ft. Notably, even though we run 4 times as many distinct agentic steps, we are just as fast at getting to a conclusion.

In realistic data science settings, missing values and invalid information are dangerous shoals t0 be navigated. Therefore, our approach to optimizing AI for data is twofold:

Introspect with extreme frequency — every data load, transformation, join, etc. should be inspected in detail to inform next steps
Optimize for context — if we are being extremely introspective, we need to give our models a succinct (yet complete) description of the data at hand to ensure efficiency

There’s a clear balance between context complexity and step-size in agent design. But by using the right representations and the right interface for AI to interact with data, we shift the efficient frontier of that tradeoff, ensuring that we can deliver both accurate results and massive speedups to data teams.

AI is bad at data. This startup can fix that Read now

Can AI Navigate by Dead Reckoning?

February 2026 Update:

November 2025:

Keep reading:

Sphinx 0.9 — A New Frontier for Data Science Agents

Goodbye, DABstep

Sphinx 0.8 — Plan Mode, Engines, and More!

Sphinx AI Achieves SOC 2 Compliance

Sphinx AI Partners with Jupyter Foundation & Linux Foundation

Sphinx 0.7.5 — Speed and Steerability