Sound Event Annotation and Detection with Less Human Effort


Sound is one of the most important mediums to understand the environment around us. Identifying a sound event in prerecorded audio (such as a police siren, a dog bark, or a creaking door in soundscapes) leads to a better understanding of the context where the sound events occurred. To do so, we record a sound scene, search for sound events of interest in the recording, determine their time positions (i.e., start and end), and give them meaningful text labels. A typical way for a human to find and annotate a sound event of interest in unlabeled audio recordings is simply to listen to the audio until one hears it and finds accurate onset and offset of the event. This sound event annotation process is very labor-intensive. My research goal is to reduce the human effort required for sound event detection and annotation. In this dissertation, I present methods to speed up the sound event annotation process. My specific goals are divided into two, in terms of what the annotated data is used for. First, sound event annotation is essential to quantify the contents of a recorded acoustic scene for a direct analysis. For this purpose of sound event annotation, I focus on building a system that helps a user to find sound events of interest and annotate them as quickly as possible. Secondly, sound event annotation is also one of the essential steps to provide the training data needed for building an AI machine that automatically identifies sound events (e.g., sound-based surveillance systems). My focus for this situation is to help human annotators spend less time labeling training data, but still build a high-performance machine learning model with less annotation effort. To achieve these goals, I present a human-in-the-loop system for sound event annotation, I-SED that lets users find sound events of interest roughly twice as fast as manually labeling the target sounds. Then, I present methods that can solve the problem where query-by-example search of I-SED could fail if the initially selected region (i.e., a query) contains multiple sound events. The solution is a new way of improving query-by-example audio search using user's vocal imitations (i.e., Imitating what they do or do not want in a query recording) which would help a user to find target sound events quickly. Finally, I present a new type of audio labeling, called point labeling, which makes it easier for human annotators to provide ground truth labels to train a machine learning sound event detection system. Point labels provide more information than weak labels, but are still faster to collect than strong labels. I show that a model trained on point-labeled data is comparable to one trained on the typical type of labeled data, strongly labeled data that is harder to collect. This dissertation will be a valuable resource for researchers and practitioners who are looking for new annotation methods under a limited budget. I expect that it will facilitate the process of sound scene understanding of humans as well as AI systems.

Alternate Identifier
Date created
Resource type
Rights statement