Adding spoken cues to transcriptions

Spoken language conveys much more information than simply the words spoken, with inflections conveying emotion and emphasis. For example, “It is on the table” will be pronounced differently when it is contrasted with “It is on the chair” from when it is contrasted with “It is under the table”.

This project will take as input a recording of spoken text and a transcription of that text, and attempt to mark each word with the time at which it occurs, and the emphasis placed on it (as indicated by volume, pitch and duration). The suggested approach is to perform speech-to-text conversion with word timestamps (e.g., https://cloud.google.com/speech-to-text/docs/async-time-offsets), and to align the output with the given transcript using the minimum edit distance criterion.

Expected background: Artificial intelligence, signal processing

Preferred background: As above.

Supervisors Lachlan Andrew and Jey Han Lau