The Hotel AI Called the Restaurant AI: A Story About What's Coming
Here's a scenario that's already technically possible, and closer to common than most people realize.
A hotel guest asks the concierge AI: "Can you book a table for four at the hotel restaurant tonight at 7?"
The concierge AI is connected to various hotel systems — room availability, housekeeping, guest preferences. But restaurant reservations run on a separate system. So it does what any agent would do: it calls the restaurant.
The restaurant's phone number connects to a Workforce Wave voice agent.
What happens next is either surprisingly elegant or surprisingly awkward, depending on how the infrastructure is built. This post is about the elegant path — and why the awkward path is where most current systems end up.
The Detection Problem
When the hotel's concierge AI calls the restaurant AI, there's an immediate question neither system has been explicitly designed to answer: is the caller a human or a bot?
This matters. The optimal response to a human caller and the optimal response to a bot caller are completely different.
A human caller wants a conversational exchange: "We have availability at 7:15 and 8:00 — which works better for you?" They want to hear a friendly voice, get options, maybe ask a follow-up question about the menu.
A bot caller wants data. It doesn't need pleasantries. It needs a structured confirmation it can parse and act on: {"table": {"time": "19:15", "partysize": 4, "confirmationnumber": "RST-4892"}}. It certainly doesn't need to listen to a 45-second TTS voice response read it back a confirmation number that it then has to extract from audio.
The current default behavior — bot calls in, voice AI answers in TTS, bot has to listen to speech and parse it — is wasteful, fragile, and slow. It works in the sense that it technically completes the interaction, but it's like two computers sending each other faxes because neither knows the other can do email.
Dual-Mode Detection: How WFW Handles It
Workforce Wave agents run in dual-mode by default. Every incoming call goes through a detection stack that runs in under 500 milliseconds:
Layer 1: SIP header inspection. Legitimate AI callers can self-identify in SIP headers. A well-designed bot includes a User-Agent or custom header that signals its nature. If a recognized header pattern is present, we skip the voice greeting entirely and respond with an HTTP-style structured acknowledgment over the audio channel.
Layer 2: Pre-negotiated tokens. If the calling system has been pre-registered with the WFW API (more on this below), its call includes an auth token that the receiving agent validates. Token present and valid → bot mode, no further detection needed.
Layer 3: Utterance pattern analysis. If the first two layers don't resolve it, the agent listens to the first utterance. Human callers have a distinctive cadence: pause, breathe, start speaking. Bot callers typically start with structured speech immediately, often with no ambient noise. The pattern analysis fires a classification in the first second of audio.
Default: if none of these resolve it with sufficient confidence, the agent defaults to human mode — voice response, conversational flow. The cost of treating a bot like a human is minor friction; the cost of treating a human like a bot is a confusing and broken experience.
What the AI Caller Actually Wants
When the hotel concierge AI calls in bot mode, it's not making a voice call in any meaningful sense. The phone number is just the address it was given. What it actually wants to send is something like:
{
"request_type": "reservation",
"party_size": 4,
"preferred_time": "2026-05-22T19:00:00",
"flexibility_minutes": 30,
"guest_name": "Hartley",
"room_number": "412",
"special_requests": "window table if available"
}
And what it wants back is:
{
"status": "confirmed",
"table": {
"time": "2026-05-22T19:15:00",
"party_size": 4,
"confirmation_number": "RST-4892",
"server_section": "main dining"
}
}
In dual-mode, the WFW restaurant agent receives the structured request, routes it to its reservation integration (or escalates to a human host if needed), and returns the structured confirmation — all without a word of TTS audio being synthesized.
The whole interaction takes under 2 seconds, no voice channel bandwidth consumed, and there's zero risk of the calling bot mishearding a confirmation number.
Agent Cards and the .well-known/agent.json Contract
For this to work elegantly, the calling AI needs to know before it calls whether the destination supports structured bot-mode requests. Otherwise it has to try voice first and fall back, which adds latency and complexity.
Workforce Wave agents expose a discovery endpoint:
GET https://api.workforcewave.com/v2/agents/{agent_id}/.well-known/agent.json
This returns an agent card — a structured document describing the agent's capabilities:
{
"agent_id": "agt_xyz789",
"business_name": "The Meridian Restaurant",
"phone_number": "+18005551234",
"capabilities": {
"voice": true,
"structured_requests": true,
"supported_request_types": ["reservation", "hours_inquiry", "menu_inquiry"],
"auth_required": true,
"auth_method": "bearer_token"
},
"structured_request_schema": {
"reservation": {
"required": ["party_size", "preferred_time"],
"optional": ["guest_name", "special_requests", "flexibility_minutes"]
}
},
"response_format": "json"
}
A calling AI that knows about agent cards can GET this document before dialing. If structuredrequests is true and its request type is in the supportedrequest_types list, it knows to call in bot mode from the start.
This is the A2A contract. It's analogous to what OpenAPI specs do for REST APIs, but at the phone call layer.
The MCP Angle
If you're building with Claude or any MCP-compatible orchestrator, this gets interesting quickly.
The WFW MCP server exposes a scout_research tool that, among other things, can fetch an agent card:
Tool: scout_research
Input: { "agent_id": "agt_xyz789", "discover": ["capabilities"] }
Output: { ...agent card contents... }
This means Claude Code — or any agent in an agentic pipeline — can discover what a target WFW agent supports before it makes a call. It can then decide: call via the MCP provision_agent tool to trigger a machine call, or delegate to a human.
An orchestrator that manages hotel operations could have a workflow: when a guest requests a dinner reservation, check whether the restaurant has a WFW agent with reservation in its supportedrequesttypes. If yes, initiate a structured machine call. If no, route to the human concierge.
This is what "bot-native infrastructure" actually means in practice — not just that bots can make and receive calls, but that the infrastructure is designed for bots to interoperate cleanly without a human in the middle.
Why This Matters Beyond Hotels
The hotel/restaurant example is vivid, but the pattern applies anywhere two AI systems need to coordinate via phone.
A referral coordinator AI at a multi-specialty medical practice calling individual practices to find appointment slots. (Post 1.5 touches on this for DSOs.) A property management AI calling maintenance vendors to schedule repairs. An insurance AI calling a medical practice to verify coverage for a prior authorization. A recruitment AI calling a staffing agency to fill a shift.
In every case, the two parties are automated systems that happen to be communicating via phone — because that's the infrastructure that exists. Most current deployments handle this by making the bot do what humans do: navigate IVR menus, speak its request out loud, parse voice responses. It works, but it's slow, brittle, and scales poorly.
Dual-mode detection plus agent cards is the path toward AI systems that recognize each other and communicate efficiently, using the phone number as an address while exchanging data at machine speed.
What WFW Provides That a DIY System Doesn't
You could build dual-mode detection yourself. It's not impossible. But here's what you'd have to build:
- SIP header parsing at the ingress layer
- Token generation and validation for pre-registered calling systems
- Utterance pattern classification (you'd need training data)
- A structured request parser and router for each request type your agent supports
- The
/.well-known/agent.jsonendpoint and the schema to populate it - Version management for the agent card spec as capabilities evolve
- Auth between calling and receiving systems at scale
That's a non-trivial infrastructure project, and it's completely orthogonal to whatever your actual product does. It's plumbing.
Workforce Wave provides it as infrastructure. Your agent is dual-mode from the moment Workforce Wave provisions it. The agent card is generated automatically from the agent's configured capabilities. Pre-registration of calling systems happens via the same API surface you already use for everything else.
The hotel doesn't have to build the voice network to offer room service. The infrastructure exists; they configure it.
Next in this series: How a DSO Put 47 Practices on Voice AI in a Weekend — what fleet-scale voice AI provisioning actually looks like, and how the A2A pattern plays out across a multi-location organization.
Ready to put AI voice agents to work in your business?
Get a Live Demo — It's FreeContinue Reading
Related Articles
What Artera Got Right (And What's Still Missing)
An honest look at the existing patient communications market and what it tells us about where voice AI is going.
The Bot Creation Matrix: Four Ways to Deploy AI, Now All Live on WFW
Dual-mode agent support just shipped, completing the Bot Creation Matrix. WFW is now the only platform where a bot can be the creator and the consumer — entirely human-free.
The Phone Call Isn't Dead. It Just Got an AI on Both Ends.
Why the phone call is having a renaissance — not dying — because AI callers and AI agents are making it programmable for the first time.