Being Three Decades Early and the One Bottleneck AI Still Hasn't Solved

In the late 1990s, while the world was marvelling at Deep Blue defeating Garry Kasparov and Dragon NaturallySpeaking finally letting people talk to computers at normal speed, a less-publicized but equally ambitious project was taking shape. CATSeye—an agent-based intelligent building management architecture patented under WIPO PCT WO1999039276—set out to do something that even today’s smart buildings struggle with: let facility managers simply talk to their buildings.

Using Microsoft's Merlin (Microsoft Agent) as its voice interface, CATSeye enabled natural language commands to a building management system. A facility manager could say “raise zone three temperature to 22 degrees” or “what is the energy consumption on floor two?” and Merlin—via a programmable COM interface—would parse the intent, route the command through the CATSeye agent architecture, and execute it against the building's SCADA and IBMS infrastructure.

This was not a research toy. CATSeye ran on distributed networked computers and was well-equipped with open standards: BACnet, LonWorks, and Modbus were integral to its architecture. That meant it could talk to different vendors' HVAC systems, lighting controllers, and energy meters without being locked into a single ecosystem—something that remains a best practice today. And yes, there was a memorable hiccups during a major presentations and I am sure our remember when Merlin simply refused to recognize our CEOs voice. Voice training was paramount then, and no amount of architectural brilliance could overcome a speaker's bad cold or a slightly different cadence.

The Vision: Distributed Agents and Conversational Buildings

What made CATSeye genuinely ahead of its time was not just the voice interface, but the agent architecture. In the 1990s, most building management was centralized SCADA—everything reported to a single master controller. CATSeye distributed intelligence across the network, with agents handling local decisions and only escalating what needed coordination.

This is philosophically very close to what researchers are now calling "edge AI" or "distributed intelligence" for smart buildings. The EU's SUST(AI)N project (2023–2026) is only now catching up to the idea that centralized "unconscious processing" is insufficient for true building awareness-8.

And the open standards support? BACnet was just becoming ANSI/ASHRAE standard 135 in 1995, with the first BACnet-7. LonWorks—which CATSeye supported—was so robust that its protocol was fixed in the early 1990s, and "a device built then will interoperate on the same network as a device built today"-10. Modbus had been around since 1979, valued for its simplicity. CATSeye's embrace of all three was a pragmatic recognition that real buildings are messy, multivendor environments.

The Persistent Bottleneck: It Was Never About the Voice

Here is the uncomfortable truth that CATSeye's history reveals—and that the AI industry in 2026 is still grappling with:

The bottleneck was never the voice recognition. It was, and remains, the data accuracy and environmental perception.

In the 1990s, the problem was obvious. Sensors were expensive, drifted frequently, and calibration was a constant chore. Voice training was mandatory, and accents or ambient noise could break the system. some issues we has to deal, wasn't a failure/issue of CATSeye's architecture; it was a failure of the entire industry's sensor and input reliability.

In 2026, the same problem persists—only now the stakes are higher and the costs are larger.

Have We Come Far Enough After Three Decades?

Voice recognition: Dramatically better, but no architectural revolution. The improvement from 1990s HMM-based systems to today's neural conformer models is staggering. Google Cloud's latest Speech-to-Text models, announced in May 2026, use a single neural network architecture (the "conformer") instead of the old three-part system of separate acoustic, pronunciation, and language models-2. The result is better accuracy across 23 languages, 61 locales, and challenging noise environments.

Chinese company claims (April 2026) claim: their latesst StepAudio 2.5 ASR is 400% faster inference, 60% lower latency, and 90% lower pricing than previous models-9. By borrowing multi-token prediction (MTP) from large language models, they've broken the traditional "one token at a time" bottleneck of speech recognition.

But here is what you will notice: these are improvements in speed, cost, and accuracy—not new capabilities. There is no fundamental architectural breakthrough that changes what speech recognition does. It still transcribes speech to text. It still requires relatively clean audio. It still cannot truly understand meaning the way a human does. The core inversion—mapping acoustic signals to linguistic units—remains conceptually the same as the HMMs of the 1990s, merely executed with vastly more compute and data.

The research frontier, as evidenced in the academic literature, is now focused on unsupervised speech recognition—systems that learn to recognize speech without paired transcripts. A 2025 paper in Speech Communication demonstrated word-level unsupervised ASR achieving 20–23% word error rates without parallel transcripts or pronunciation lexicons-6. That is genuinely new. But it is research, not product—and it addresses the low-resource language problem, not the building management problem.

Environmental perception and data accuracy: Still the bottleneck. This is where the industry has failed to progress meaningfully. A 2025 analysis from Mastech Digital notes that "only 8% of organizations are data-ready" for AI initiatives-1. The IBM Institute for Business Value's 2025 CEO Study found that only 16% of AI programs have successfully scaled across the enterprise-8.

Why? The factors are maddeningly familiar to anyone who worked on CATSeye:

Data is siloed across functions, platforms, and regions
Quality is inconsistent—duplication, staleness, mislabeling remain widespread
Lineage is missing—cannot track how data flows and transforms
Context is tribal—locked in the heads of experts, not encoded in systems

The Forbes Tech Council (March 2026) put it bluntly: "AI won't go mainstream in companies without high-accuracy web data"-5. And the costs of poor data quality are staggering—Gartner found poor data quality costs organizations at least $12.9 million per year on average-5 -8.

For building management specifically, the situation is no better. As one analysis of BACnet, Modbus, and LonWorks notes, Modbus registers have "no intrinsic meaning"—they require manufacturer documentation to interpret-3 (Catseye agents were quipped to address this issue then). BACnet offers standardized ontology (a temperature sensor publishes in °C without ambiguity), but that solves semantic interoperability, not sensor reliability. LonWorks pioneered standardized data types (SNVTs - Standard Network Variable Types) in the 1990s, but its ecosystem has since contracted-3 -10.

The fundamental problem is data: a sensor sitting in a real building accumulates dust, drifts out of calibration, experiences electromagnetic interference, and occasionally fails entirely. No amount of AI cleverness can compensate for garbage input, similarly in all other ai applications.

What Caused This Underdevelopment?

If voice recognition advanced so dramatically while environmental perception stagnated, the reasons are structural:

1. The data was easier to acquire for voice. Speech recognition benefited from vast, publicly available datasets (LibriSpeech, Common Voice, YouTube audio). You could scrape the web for text to train language models. For building sensors? Every building is different. Sensor placements vary. Equipment ages and degrades. There is no "ImageNet for HVAC sensors."

2. The economic incentives aligned. Amazon, Google, and Apple poured billions into voice because it unlocked consumer ecosystems—buy more stuff, use more services, stay locked in. Who pours billions into better building temperature sensors? The margin on a sensor is tiny. The economic return on making it 0.1% more accurate is even tinier.

3. The research community focused elsewhere. The deep learning revolution from 2012 onward prioritized problems that were both hard and had clear benchmarks: ImageNet for vision, LibriSpeech for voice, SQuAD for reading comprehension. Building management? No equivalent benchmark. No leaderboard. No glamour.

4. The "last mile" problem remains unsolved. You can have the most sophisticated AI model in the world. If the sensor input is wrong, the output is wrong. Edge AI, federated learning, and self-calibrating sensors are active research areas, but production deployments remain rare.

CATSeye Against Today's Products: An Honest Assessment

If CATSeye were re-released today, here is how it would compare:

Dimension	CATSeye (1999)	2026 Best Practice
Voice interface	Merlin, required training, limited vocabulary	LLM-based, zero-training, multi-turn dialogue
Architecture	Distributed agents (ahead of its time)	Cloud-centric with edge nodes (similar concept)
Open standards	BACnet, LonWorks, Modbus	BACnet dominates; LonWorks legacy
Sensor reliability	Manual calibration, drift monitoring	Marginally improved; still largely manual
Environmental diagnosis	Rule-based inference	ML-based, but still input-limited
Deployment model	On-premise distributed	Cloud or hybrid

The voice-to-building part is orders of magnitude better today. The building-perception part is only incrementally better.

The Question CEOs Should Be Asking

The Forbes Tech Council article (March 2026) suggests four questions every CEO should ask about their AI systems-5:

What outside data does our AI rely on, and how up-to-date is it when it matters?
Can we trace outputs back to their original sources?
What signals are missing from our view of the external world, and how would we know?
If this decision were challenged, could we explain and defend how it was made?

For building management—and for most enterprise AI—question #3 is the killer. We do not know what we are missing because our sensors are not telling us. And until we solve that, all the voice recognition wizardry in the world will not make a building truly intelligent.

The Bottom Line

CATSeye was years ahead of its time. It was architecturally prescient. It correctly identified that buildings needed distributed intelligence, open standards, and a natural language interface. It even (painfully) identified the real bottleneck: garbage in, garbage out.

CATSeye had some voice recognition issues. But it was a failure of the entire industry's understanding that voice recognition without robust environmental sensing is a party trick, not a building management tool.

After three decades, we have made the party trick perfect. We have made it cheap, fast, and multilingual. But the building management problem—the core problem of knowing what is actually happening in a physical environment—remains stubbornly, expensively, unsolved.

The bottleneck we had in the 1990s is still the bottleneck today. That is not a failure of the architecture. It is a failure of an industry that prioritized talking to buildings over listening to them.

Written by : Sanjaya Gunasiri

VottsUp

Sunday, May 10, 2026

The CATSeye (AI) Paradigm in 1990s