Multimodal turn detection that combines audio intonation and text context to accurately determine when a speaker has finished their turn in a conversation. The model projects audio embeddings into the ...