Conclusion
Line level alignment of text and musical audio
Text is crucial for duration estimation
Rhythm detection can inform downstream components
Accuracy of chorus detection is vital
Vocal detection model uses training based approach
For real-time performance: need to explore alternative vocal
detection models