LyricAlly: Automatic Synchronization of Acoustic Musical Signals and Textual Lyrics

Ye Wang, Min Yen Kan, Tin Lay Nwe, Arun Shenoy, Jun Yin

Introduction

Motivation

Is singing voice transcription really necessary ?

Speech recognizers cannot be directly deployed

Availability of music lyrics on the internet

Slide 3

"Bryan Adams – Back to..."

Bryan Adams – Back to you

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

"Chorus sections detected by high..."

Chorus sections detected by high level of repetition.

Accounts for phoneme, word and line level repetition.

Slide 14

Slide 15

"Observation : Gaps between sections..."

Observation : Gaps between sections are shorter and more stable as compared to the sections themselves

Slide 17

Slide 18

Slide 19

"LYRCIALLY SYSTEM DEMO"

LYRCIALLY SYSTEM DEMO

"Starting point calculation more difficult..."

Starting point calculation more difficult than duration estimation

"Decreasing order of criticality:"

Decreasing order of criticality:

"Line level alignment of text..."

Line level alignment of text and musical audio

Text is crucial for duration estimation

Rhythm detection can inform downstream components

Accuracy of chorus detection is vital

Vocal detection model uses training based approach

For real-time performance: need to explore alternative vocal detection models

"GENERAL"

GENERAL

Limitation - 4/4 Meter, V₁-C₁-V₂-C₂-B-O

Future Work – alternate meter and song structure

AUDIO

Limitation – MM-HMM Optimal Classifier ?

Future Work - mixture modeling or classifiers like SVM and NN

Limitation – Restricted to percussive audio

Future Work – new approach to drumless rhythm detection

TEXT

Limitation – Phoneme duration estimation independent of tempo

Future Work – Tempo information re-estimation


Motivation
	Is singing voice transcription really necessary ?

		Speech recognizers cannot be directly deployed
		Availability of music lyrics on the internet


	Chorus sections detected by high level of repetition.
		Accounts for phoneme, word and line level repetition.


	Observation : Gaps between sections are shorter and more stable as compared to the sections themselves


	Starting point calculation more difficult than duration estimation


	Line level alignment of text and musical audio

	Text is crucial for duration estimation

	Rhythm detection can inform downstream components

	Accuracy of chorus detection is vital

	Vocal detection model uses training based approach

	For real-time performance: need to explore alternative vocal detection models


	GENERAL
	Limitation - 4/4 Meter, V₁-C₁-V₂-C₂-B-O
	Future Work – alternate meter and song structure


	AUDIO
	Limitation – MM-HMM Optimal Classifier ?
	Future Work - mixture modeling or classifiers like SVM and NN

	Limitation – Restricted to percussive audio
	Future Work – new approach to drumless rhythm detection


	TEXT
	Limitation – Phoneme duration estimation independent of tempo
	Future Work – Tempo information re-estimation