Work

Score-Informed and Hierarchical Methods for Computational Musical Scene Analysis

Public

Imagine sitting in a room listening to some friends play a song. Perhaps one friend is playing guitar, another playing bass, and a third is playing drums. The musical content in this scene is extraordinarily complex, yet it contains many types of structure that is easy for us to comprehend. For instance, even though we only hear the mixture of all the musicians playing together, we have a good idea of what each of the instruments sounds like by itself. Furthermore, we are able to understand the notes that each instrument is playing, giving us information about musical events are happening and when they occur. While it might be trivial for our human brains to decipher this musical scene and extract all of these structures, it is a difficult problem to make a machine do a similar analysis. These are just some of the problems that fall under the umbrella of Musical Scene Analysis, a subset of machine learning that specifically targets the analysis of music as raw audio data. The goal of this field is to identify and locate the key elements of a musical scene that humans attend to when we listen to songs. As a matter of fact, the structures I mentioned in the previous paragraph both have corresponding Musical Scene Analysis tasks. Estimating what each of the instrument sources sounds like in isolation is a task called source separation, and is a fundamental task in Musical Scene Analysis. Determining what notes musicians play and when they occur is called Automatic Music Transcription (AMT) and is also a fundamental problem in the field. Making progress in solving these two tasks can enable a wide array of applications, ranging from search, retrieval, and analyses of large scale musical corpora (e.g., Spotify, Apple Music, YouTube) to creative uses for helping musicians. In this dissertation, I extend the capabilities of Musical Scene Analysis systems by creating source separation and AMT systems that are (a) able to support more instruments, and (b) are more controllable than prior work. I do this by introducing Score-Informed (i.e., using musical score data as a conditioning input or as a training target) and hierarchical methods to analyze musical scenes. Making systems that are able to support more instruments is important because many musical scenes contain multiple instruments simultaneously. Yet, most prior AMT systems are trained on isolated piano recordings, and thus cannot transcribe multiple instruments in a mixture. Furthermore, many prior separation systems only support a small number of fixed source types, modeling each source independently. Here, I extend the capabilities of existing separation and transcription systems. I do so, first, by producing a combined separation and AMT system that is able to simultaneously separate and transcribe up to five instruments in a mixture–many more instruments than most AMT systems typically consider. Additionally, I reframe the source separation problem as hierarchical, showing how a system is able to learn relationships between different source types to efficaciously separate much more fine-grained source types than most previous systems. To make systems controllable for a given input example, the goal is to create separation systems whose output can be altered without costly process of retraining it from scratch. In other words, I want to make source separation systems that are steerable at inference time. In this dissertation, I accomplish this in two ways. The first way is through the hierarchical lens, where I introduce a Query-by-Example mechanism that enables it to change which source it separates at inference time, giving an end-user control knob which source gets separated. The second method of control emerges from the notion that the Musical Scene Analysis tools we build will never be perfect, therefore it is important to have the ability to correct these systems if they make mistakes. From this idea, I revive the paradigm of Score-Informed separation–or using musical score data as a conditioning input to the separator–and rejuvenate it for the deep learning era. I show how a Deep Score-Informed Separation system allows making note-level edits to the source estimates at inference time, proving the ability to make fine-grained edits than previous deep learning-based separation systems. For two of the three projects in this document, I provide mockups of user interfaces built atop of the techniques proposed here, as a means of envisioning how these systems might work in the hands of musicians and artists. I hope the work in this dissertation paves the way for a new generation of work that emphasizes flexibility and controllability, for the sake of making systems that can impact the workflow of end-users.

Creator
DOI
Subject
Language
Alternate Identifier
Keyword
Date created
Resource type
Rights statement

Relationships

Items