This month we are talking about the different categories our AI can discern. We have already discussed moods and valences and took a detailed overview of the rhythmic features. This week will explain features such as the type of vocals or main instruments on the mix. We will also talk about the scale, main key, and distortion levels. These categories require a great amount of data to be properly identified. For this reason, it makes use of our great friend: the metadata.
It is easy for our ears, in most cases, to identify whether the song has female or male vocals or if it does not have vocals at all. As humans, we don´t have the best hearing system in the animal realm. Nonetheless, we are specialized in identifying the frequency ranges and nuances of human voices. We can resolve them from an instrumental background, and we can, almost seamlessly, tell if the voice sounds feminine or masculine. Nonetheless, when we feed a computer with a musical composition, all the sounds in the spectrum get mixed on a complex soundwave. The frequencies of the human voices merge with other instruments and effects, making the analysis harder. We help our AI during the training, using a mix of annotated metadata and human curation. If we do not need to listen and manually annotate this information, we get more time to check for more complex musical features. Have you ever listened to a new song where you aren´t sure whether the singer is a boy or a girl? Just like us, the AI can get it wrong sometimes. Tracy Chapman, for example, is one of those cases that get everybody (humans and AIs) confused:
As the name says, this category indicates the most dominant instruments in the song, besides the vocals. Our AI is trained to tag several instruments such as piano, electronics, guitar, strings, synthesizer, wind, saxophone, flute, trumpet, drum kit, keys, accordion, violin, harpsichord, choir, cello, and electric bass. For this category, during the training, we also took metadata information into account. Recently, we have started to implement an extra layer of specialization, based on a thesaurus, to pull up fewer common instruments, based on natural language correlations.
This category refers to whether the music contains acoustic, electric, electronic, or mixed instruments and sound textures. For training the AI for this category, we need a mix of metadata and human listening feedback. Can you tell the difference?
This category refers to how rough the song is. Roughness is a complex feature that has to do with the sort of distortion the instruments or vocals have. You can perceive the distortion when you visualize the sound waves. Even if it is an objective trait, the training of the AI still requires human hearing. The distortion depends on secondary frequencies derived from the primary frequency of the voice or instrument sound. Those secondary frequencies build up what we call the “timbre”. Normal harmonics and harmonic distortions build up pleasant, warm timbres that feel smooth, full, and soft. When the distortion is non-harmonic, the secondary frequencies of the instrument or voice crash with the primary frequencies, generating a sense of roughness. The old magnetic tapes, when saturated, produce pleasant harmonic distortions, desirable for many musical styles. The digital systems, in contrast, clip the tracks abruptly when they reach saturation, generating non-harmonic distortions. Our AI categorizes the roughness levels of the tracks as “clear”, “moderate roughness”, and “distorted”.
Our AI can also differentiate songs written in major, minor, and neutral keys. We consider that the scale has a neutral key when the song uses alternative scales or presents a mix of major and minor keys in different sections. The scale refers to the relative distance between the notes used in the melodic components of the track.
This category refers to the dominant key of the song. The key is the principal frequency (aka, musical note) around which the other notes used within the melody work together. Only expert human ears can “easily” identify the key of a song. To properly train our AI in this condition, we need a mix of annotated metadata and human curation. Luckily, the key of a song is the reference piece of information to transpose the composition. Thus, it is easy to retrieve information about the key of a recording from several databases accessible online. We used this information to help our AI during its learning process. Nonetheless, this feature is still challenging because some songs do not fit in a particular key or present key changes along the track. In these cases, our AI tags them as “unclear”.
We hope this served as an overview of this pre-last subgroup of categories. This time we focused on tags where metadata and human hearing combine in the training of the AI engine. Next week we will be talking about the last group of categories our AI can identify, along with other music examples.
There are still so many categories to discover on musicube! More soon!