Are transcripts created automatically?

Yes. Transcription is built into the recording workflow, not a separate step you have to trigger. Whenever the recording bot captures a Microsoft Teams call, that call is automatically transcribed and the resulting text is stored alongside the audio. There is no button to press and nothing for your users to remember.

Does it identify individual speakers?

Yes. Transcription uses speaker diarization, which separates a conversation into distinct speaker segments and labels each one. So instead of a single wall of unattributed text, you get a transcript that shows who said what across a multi-party call. Because the recording bot also captures each participant on a separate unmixed track, that attribution is cleaner than diarization applied to a single mixed recording.

How accurate is the transcription?

Accuracy depends on the call itself. Clear audio, one person speaking at a time, familiar accents, and common vocabulary all produce strong results, while heavy crosstalk, poor connections, strong accents, or specialized domain terminology make any speech-to-text engine work harder. We do not publish a single accuracy figure because a real number for one call would be misleading for another. What we can say is that capturing each participant on a separate unmixed track improves speaker attribution and reduces the crosstalk that otherwise degrades transcripts.

Where are transcripts stored?

Transcripts are uploaded automatically to a SharePoint document library in your own Microsoft 365 tenant, right next to the audio recording of the same call. They stay inside your environment, under your access controls and your retention rules, so your compliance and legal teams can search, review, and export them without relying on a third-party platform.

Teams Call Recording with Transcription

Automatic transcription of every recorded call

Transcription is not an add-on or a separate step in our service — it is part of the recording workflow itself. Every call the bot captures is transcribed automatically, so the moment a conversation is recorded it also becomes readable text. Nobody has to upload an audio file to a transcription tool, kick off a job, or remember to do anything at all. If the call was recorded, the transcript exists.

That matters for compliance specifically. A recording program only works if it is exhaustive and consistent, and the same is true of the text derived from it. When transcription is tied directly to capture, there is no gap between the calls you recorded and the calls you can actually search. Every in-scope conversation produces both listenable audio and readable, speaker-labelled text, with no manual intervention that could be skipped, forgotten, or done inconsistently from one team to the next.

This page explains how that transcription is produced, why the underlying audio architecture makes it more reliable than transcribing a typical single-track recording, and how the finished text is stored so your compliance, QA, and legal teams can use it. For the broader capture workflow that feeds transcription, see the recording bot and the compliance recording service.

Powered by Azure AI Speech-to-Text

The transcription itself is produced by Azure AI Speech-to-Text, Microsoft's enterprise speech recognition service. Rather than build a bespoke transcription engine, we use a mature, continuously improved speech platform that is designed for exactly this kind of workload: converting real conversational audio into text at scale, with support for the speaker diarization that compliance review depends on.

Using Azure's speech service keeps the transcription pipeline inside a well-understood Microsoft ecosystem — the same ecosystem your Teams calls, your Microsoft 365 tenant, and your SharePoint storage already live in. It also means the recognition quality benefits from Microsoft's ongoing investment in speech models, so the transcripts you get today are not frozen against a single point-in-time engine. You can read Microsoft's own documentation of the service at Azure AI Speech on Microsoft Learn.

Speaker diarization: who said what

A raw transcript of a multi-party call is not very useful if it is one undifferentiated block of text. What compliance officers, supervisors, and legal reviewers actually need to know is not just what was said but who said it. That is the job of speaker diarization.

Diarization is the process of partitioning an audio stream into segments according to speaker identity — in plain terms, working out when the speaker changes and grouping the segments that belong to the same voice. The output is a transcript broken into turns, each attributed to a distinct speaker, so a conversation reads the way a conversation actually happened: one person, then another, then a reply. Instead of "the client agreed to the terms" floating with no owner, you can see which participant said it and which participant responded.

For regulated communications this is the difference between a recording you have and evidence you can rely on. When a supervisor reviews a flagged call, when a QA analyst scores an interaction, or when counsel reconstructs a disputed conversation, speaker-attributed text lets them find the relevant exchange and see exactly who committed to what — without listening to the entire call to keep track of voices.

Microsoft documents the speaker separation behavior of its speech service in its real-time diarization guide.

Why per-participant audio improves accuracy

Here is where our recording architecture gives transcription a real advantage. Many recording tools produce a single mixed track — every participant blended into one audio stream — and then ask a speech engine to untangle who was speaking from that blend. That works, but crosstalk, overlapping speech, and similar-sounding voices all make the untangling harder, and attribution errors follow.

Our bot does something different. In addition to a combined recording, it captures per-participant (unmixed) audio, with each speaker on their own separate track. When each voice arrives already isolated, the system does not have to guess which of several overlapping people is talking — the separation is captured at the source rather than reconstructed afterward. The practical result is cleaner speaker attribution and less confusion on busy multi-party calls, precisely the calls where getting attribution right matters most.

This per-participant capture is a property of how the recording bot joins and records calls, and it is one of the concrete reasons transcription here holds up better than transcribing an ordinary single-track meeting recording.

Searchable, auditable text

Turning speech into text changes what you can do with a recording. Audio is linear — to find one sentence you either scrub through the whole file or hope you know the timestamp. Text is searchable. Once every call is transcribed, your archive becomes something you can actually query:

Find calls by keyword. Search across transcripts for a product name, a promised commitment, a regulated phrase, or any term that matters to your program — instead of relistening to hours of audio.
Find who said it. Because transcripts are speaker-attributed, you can trace a statement back to a specific participant rather than to an anonymous voice.
Support audit and QA. Reviewers can read a call in seconds, quote it exactly, and move on — making supervision and quality assurance far more efficient than audio-only review.
Support eDiscovery. When a matter requires producing communications, readable and searchable transcripts make identifying and reviewing responsive calls dramatically faster.

For legal and litigation-hold scenarios in particular, searchable speaker-attributed transcripts turn a pile of recordings into a reviewable evidence set. See how this plays out for legal teams, where the ability to locate and read a specific exchange quickly is often the whole point of recording in the first place.

What affects transcription accuracy

We want to be straight about this: no speech-to-text system transcribes every conversation perfectly, and any vendor quoting a single universal accuracy percentage is glossing over how much it depends on the call. Accuracy is a property of the audio and the conversation, not a fixed number. The main factors are:

Audio quality. Clear microphones and stable connections transcribe well; poor connections, background noise, and low-quality devices degrade recognition for any engine.
Crosstalk and overlapping speech. When several people talk over one another, both transcription and attribution get harder. This is exactly where per-participant capture helps most.
Accents and speaking style. Strong or less common accents, fast speech, and heavy use of jargon can all reduce word-level accuracy.
Domain-specific terminology. Industry terms, product names, ticker symbols, and acronyms are harder to recognize than everyday vocabulary.

The controllable factors point in a consistent direction: better source audio produces better transcripts. That is one more reason the recording bot's per-participant unmixed audio matters — isolating each speaker at capture time removes a major source of crosstalk error before the speech engine ever sees the audio. We treat transcription as a genuinely useful review and search aid, and we are honest that human review still has a place for the highest-stakes calls.

Transcripts stored alongside recordings

A transcript is only useful if the right people can actually get to it. Ours are stored where the rest of the evidence already lives: both the audio recording and its transcript upload automatically to a SharePoint document library in your own Microsoft 365 tenant. The two artifacts stay together, call by call, so anyone reviewing an interaction has the audio and the text side by side.

Because everything lands inside your tenant, transcripts inherit your environment's access controls and your retention configuration — you decide who can read them and how long they are kept, and the data is never co-mingled with another organization's. There is no separate transcription portal to govern and no third-party copy of your conversations to worry about. For the full picture of how storage works, see SharePoint storage.

Transcription with speaker separation