Speaker diarization is the task of determining 'who spoke when' in a long multi-speaker audio recording, marking the start and end of each speaker's turns in the interaction. It can be useful for transcribing meetings, classroom speech, or medical interactions. Often times, diarization systems use voice activity detection to find segments of continuous speech, extract speaker embedding vectors, and cluster the vectors to group together segments likely from the same speaker. 

State University of New York at Stony Brook

Speech recognition

An on-going but a helpful book resource about NLP
https://web.stanford.edu/~jurafsky/slp3/

Speech and Language Processing (3rd ed. draft) 

Connectionist Temporal Classification (CTC) is a combination of an algorithm and a loss function that transform a sequence of acoustic sounds into a sequence of letters, i.e., speech recognition. Unlike the encoder-decoder structure which transforms a longer input sequence to a shorter output sequence, CTC first transforms the input sequence to an equal length sequence, then transforms the alignment to a shorter output sequence by merging consecutive duplicates and removing blanks. The output vector of the first step is called alignment, and the transformation in second step is called collapsing function.

Connectionist Temporal Classification

Speaker diarization

The task of wake word detection is to detect a word or short phrase, usually in order to wake up a voice-enable assistant like Alexa, Siri, or the Google Assistant.
The goal with wake words is build the detection into small devices at the computing edge, to maintain privacy by transmitting the least amount of user speech to a cloud-based server.  

Wake word

Speaker recognition is the task of identifying a speaker. We generally distinguish the subtasks of speaker verification, where we make a binary decision (is this speaker X or not?) 

Learn Before

Related