Taken June 12, 2019. Live Anderson .Paak concert in the Toyota Music Factory, Dallas TX.

Predicting the Odds of a Song Reaching the Billboard Hot 100

Using elements of music to predict the popularity of songs.

Adrian Melo
16 min readDec 10, 2020

--

Github Link

By Aditi Katragadda, Xylon Vester, Mohana Seelan, Pragna Subrahmanya, Rainey Shah, Adrian Melo.

There is something inexplicably intricate about music that attracts humans. As the renowned neuroscientist, Oliver Sacks, said in his best selling book Musicophilia:

“What an odd thing it is to see an entire species — billions of people — playing with, listening to meaningless tonal patterns, occupied and preoccupied for much of their time by what they call ‘music.’ ”

It is interesting to see how the different combinations of these series of “tonal patterns’’ are more successful than others. There are songs that reach billions of listens, and some that reach only a few. In this project, we try to aim to get a better understanding of the complexities of music and what makes a song popular. Through the use of elements that Spotify uses to define music, we predicted the odds of a song reaching the Billboard Top 100. In doing so, we can help in the music creation process for upcoming artists to increase their chances of success.

I. Project Summary

A brief overview of the article

Goal: Accurately predict if a song will chart in the Billboard Hot 100

Data: The data used was from two different Kaggle datasets: Spotify Dataset 1921–2020 and Data on Songs from Billboard 1999–2019. This was processed to make training better and easier.

Models: We used multiple models to try and classify the songs using soft predictions. The model used were the following: Random Forest, Support Vector Machine, K-Nearest Neighbors, CATBoost Classifier, and XGBoost Classifier

Results: The model with the best AUC score was XGBoost and Random Forest with a score of 0.85.

Conclusion: In the end, we realized that there were alot of external factors that determined whether a song would be a Billboard Hit or not. A song’s audio features are not as important as the artist’s popularity, time of release, etc.

II. Data

In this section we will be reviewing the dataset that was used as the baseline for the project, the process that went into cleaning up the data, the features that were engineered and added to the set, and the trends uncovered within the data.

Base Datasets

There are two base datasets used in the project. The first was obtained from the following Kaggle dataset: Spotify Dataset 1921–2020. This dataset was collected using Spotify’s Web API. We decided on using this dataset as it included a lot of interesting features of the elements of music such as danceability, popularity, etc. These are used by Spotify for the purpose of their algorithms to create suggestions tailored for the user’s trends. You can find more info on these features in Spotify’s developer blog. The list of definitions of the features can be found in the Appendix Section as some of the names for these features are not straightforward.
The second dataset was obtained from the following Kaggle dataset: Data on Songs from Billboard 1999–2019. This data set was created by scraping the Billboard Hot 100 for every week from 1999 to 2019. Whether or not a song appeared on the Hot 100 list for any week from 1999 to 2019 would serve as a binary outcome variable to be predicted.

Preprocessing

Once the dataset was selected, we moved on to adapt the set for it to best fit the goal of the project. We immediately decided to reduce the span of time of the data set. We obtained ~22,000 random spotify songs, sampled at about 2,000 songs a year, from the years 2000–2010. We dropped the whole popularity feature from the Spotify dataset as it would essentially be cheating the model. Next, we reformatted the Spotify Dataset to match the Billboard dataset. This included removing: all duplicates of song names to prevent collisions between spotify’s clean and explicit versions of songs both counting as hits, all rows with Artist names with non-ascii recognized characters such as Beyoncé, all rows with nonetype/nan fields (unlabeled artists, songs, or other features), all rows of spotify dataset with a single quote as a part of the artist name such as Jack’s Mannequin. Lastly we reformatted song names of both spotify dataset and billboard hits by making every letter lowercase and removing all spaces.

Then, we did a crucial step of preprocessing— defining the outcome. To do so, we created a dictionary of all unique songs that made it to the Billboard Hot 100. In the dictionary, all keys are the song name and the values are the associated artists. Once the dictionary was created, we iterated through every song in the Spotify dataset and checked if the song is in the dictionary of songs that charted on the Hot 100. If a song is present, we checked that the artists from the Spotify and Billboard dataset are consistent to indicate that the song is in fact the same one. If a matching song-artist(s) pair is detected between datasets, then we label the respective row of the spotify dataset with a 1, otherwise a 0.

Feature Engineering

For feature engineering, we also used the Spotify Web API. We drop some features based on their correlation map. We attempted to add features that were missing in the dataset which we thought would help out in successfully making a prediction.

Correlation Map and One-Hot Encoding

Due to its high correlation with the loudness feature, the energy feature column was dropped to reduce redundancies. The categorical features (year, explicit, mode, and key) were one-hot encoded.

Spotify Audio Feature Correlation Map

Adding the Genre

Our dataset was lacking the music genre which we believed would be a useful feature to include. Songs within the same genre can have incredibly varied tempo, valence, danceability, so this would not be a repeated feature. Two different attempts were made to add music genres of the songs to our dataset:

  1. Using Spotify API.

Spotify provides a dataset of artists and the music genres they are most associated with. For example, the following commands:

print(get_artist_genres('Tycho'))
print(get_artist_genres('Train'))

Gave the following output:

['chillwave', 'downtempo', 'electronica', 'indietronica', 'intelligent dance music']
['dance pop', 'neo mellow', 'pop', 'pop rock']

Artists tend to stay within their musical genre when producing new songs, so we believed this would be a good fit. Since Spotify has categorized over 1000 different music genres, we decided to generalize to the most popular music genres and not their sub-genres. As can be seen in the image, dance pop would get classified as pop and electronica would get classified as EDM. While EDM and electronica have differences in terms of certain features like tempo they fall under a similar category.

2. Using Another Dataset

We were able to find another dataset that contained over 200,000 songs marked with the genre. However, attempts to merge the two datasets failed. The secondary dataset did not contain around half the songs from our primary dataset, so we were not able to use this method.

Adding Seasons Feature

We also decided to add a Seasons feature to our dataset, which records the season in which the song released (ex. Summer, Fall, etc). This information was extracted from the “release date” feature in our original dataset. We noticed some interesting trends in the data with relation to this feature and therefore decided to include it in our training dataset.

Adding Artist Popularity Features

Two features regarding artist popularity were also added to both the train and test datasets. The first was average hit frequency across all involved artists: calculated by averaging the hit frequency (number of hits/total number of songs) for all involved artists per song in the train dataset. The second was the total number of hits across all involved artists: calculated by summing the total number of hits for all involved per song in the train dataset. Only the train dataset was used to compute these two features, and they were then added to the test dataset accordingly. For cases where there was an artist in the test dataset that never showed up in the train dataset, the respective value became the average of each feature column rather than a 0.

Trends

Before running any model, we wanted to explore our training dataset and analyze any trends that might be useful. Here is what we found:

Total number of songs in train dataset: 11,909
Number of songs that charted: 1,762

Artist Collaborations

Out of the 1,762 songs that charted on the Billboard Top 100, 1,306 were performed by solo artists, and 269 were performed by 2 artists. This means that around 74% of songs that charted were by 1 artist, and around 15% were by 2.

Loudness

Loudness ranges from -60dB to 0 dB with higher numbers representing higher loudness. In the decade that we’re analyzing (2000–2010), we see that there was around a 1.35 dB increase in loudness in recorded music. A 1.35 dB increase is around a 35% change in sound energy but is around a 12% increase in loudness perceptible to the human ear. This phenomenon has been termed as “the loudness race.” Maintaining the quality of a song after digital compression is easier if the song is louder because that means the noise, or audio coding errors, are less perceptible to the ear.

Full range of the Loudness metric.
Zoomed in view to see difference in the Loudness metric through the years.

Gentle vs. Upbeat Music

Energy ranges from 0.0 to 1.0 were higher numbers represent faster, louder, and more intense music. Acousticness ranges from 0.0 to 1.0, with 1.0 meaning that the song is acoustic with the highest confidence.

We plotted the average Energy and average acousticness of songs in this decade and noticed that Energy levels steadily increased and Acousticness levels steadily decreased. Hence, we can observe that recorded music was moving slightly away from gentler music and towards slightly more upbeat music. This also correlates to the increasing trends in loudness that we previously observed.

Full range of the Energy and Acousticness metric.
Zoomed in view to see difference in the Energy and Acousticness metric through the years.

Valence

The higher the valence on a scale from 0.0 to 1.0, the happier the track. We noticed a slight decline in the valence of recorded music in this decade. Since valence is Spotify’s metric for the ‘happiness’ of a track, this data reveals that popular music was getting slightly sadder over this decade.

Full range of the Valence metric
Zoomed in view to see difference in the Valence metric through the years.

Danceability

Danceability is a value from 0.0 to 1.0 with higher scores given to songs with high temps, rhythm stability, beat strength, and regularity. Danceability was also on a slight decline this decade. Valence and danceability measure similar song components, so a decrease in one feature is likely to also mean there will be a decrease in the other.

Full range of the Danceability metric.
Zoomed in view to see difference in the Danceability metric through the years.

Happiness with Change in Seasons

According to Spotify’s documentation, a song with more than 0.5 valence levels can be assumed to be a ‘happy’ song. Using this metric, we analyzed the happiness trends of songs every season. From the graph below, we can see that more ‘happy’ songs were released during the winter.

III. Models

With the data set (pun intended), we then moved on to start training the following models.

Random Forest

Relative to the other models in this project, the Random Forest classifier performed sufficiently well. Without any model tuning, the Random Forest classifier achieved an AUC of 1.000 on the training data and 0.756 on the test data. The performance on the training data indicates that the model memorized the dataset and was overfitting.

To tune the model, the hyperparameters ‘n_estimators’ and ‘max_depth’ were selected for a 5-fold cross-validation using the following values:

Hyperparameters for CV in RF

After grid search was performed, the model was retrained using the optimal hyperparameters n_estimators=100, max_depth=16. With these parameters, the Random Forest classifier saw a small decrease in the training data AUC from 1.000 to 0.999 and a small increase in test AUC from 0.756 to 0.763.

Since the Random Forest classifier does not offer specialized support for categorical variables, a one-hot encoding of the categorical features (year, mode, explicit, and key) was performed. Retraining the model on this data did not offer a significant improvement in model accuracy. The model test AUC dropped from 0.763 to 0.762 after OHE was performed.

After training, the relative importance of the different features in the dataset was recorded and graphed:

Graph: Relative Importances of the Different Features for the RF Classifier

Feature engineering was performed to produce two new features that reflect artist popularity. This resulted in the creation of the Song_Hit_Train_Data_Final_With_Artist_Popularities.xlsx and Song_Hit_Test_Data_Final_With_Artist_Popularities.xlsx datasets. After training the Random Forest classifier with these new features and the OHE categorical variables, it achieved a training AUC of 0.999 and a test AUC of 0.850.

Graph: Relative Importances of the Different Features for the RF Classifier after Feature Creation

Overall, the Random Forest Classifier alone offered satisfactory results.

Support Vector Machine

For a complex classification problem like this, models like the SVM alone do not offer enough predictive power to obtain satisfying results. The SVM trained on the data obtained an AUC of just 0.52 on the test set, which is not much better than random guessing. However, AUC is a metric that prefers soft predictions, which the SVM cannot make since it just computes an optimal set of values for the hyperplane margin. Even when other factors of the model are considered, such as fitting time, precision, or accuracy, the SVM does not keep up with some of the more robust models like Random Forest or XGBoost.

K-Nearest Neighbors

Untuned, the K-Nearest Neighbors algorithm did not perform much better than the SVM model. Although it has a rather low fitting time, after training the classifier on the dataset it achieved a training AUC of 0.849 and a test AUC of 0.565.

A 5-fold cross-validation using grid-search was performed to select the optimal training parameters using the following parameter values:

Hyperparameters for CV in RF

The cross-validation results yielded parameter configurations of n_neighbors=30, weights=uniform. After training the classifier with the tuned hyperparameters, it achieved a training AUC of 0.717 and a test AUC of 0.618. While this model offers significant improvement compared to the weaker models for this project like the SVM, it did not achieve a very good AUC score overall.

CATBoost Classifier

CatBoost was one of the few models we tried that scored on the higher end of the spectrum. It offers enhanced support for datasets that deal with categorical features. It has the advantage of maintaining the structure of the original dataset, as the user does not need to perform OHE on the categorical variables since the classifier takes care of them internally.

CatBoost selects hyperparameter values during initialization depending on the other initialized hyperparameters which gives it an edge over other models even without tuning. After 1000 iterations of training, the default CatBoost classifier obtained a training AUC of 0.962 and a test AUC of 0.777.

After specifying four categorical features (year, mode, key, explicit), the CatBoost training performance saw a small decrease in AUC from 0.962 to 0.937 while the test performance saw a small improvement in AUC from 0.777 to 0.778. To further improve performance, the year feature was dropped from the dataset and the release_date feature was added in its place. This small change resulted in the training AUC increasing from 0.937 to 0.946 and the test AUC increasing from 0.778 to 0.792.

XGBoost Classifier

XGBoost like CatBoost also scored high on the initial model run (without any hyperparameter tuning and without added features), with an AUC score of 0.84. The next steps were to tune the hyperparameters and train the model to include additional features such as the genre, artist popularity, and season the song was released in. Below hyperparameter tuning file. The code snippet with the hyperparameter tuning can be found in “In [59]”:

The hyperparameter tuning utilized GridSearchCV. GridSearch looks for the optimal hyperparameters that optimize the value of the scoring function, which in our case is the AUC score. We also use cross-validation to improve model predictions on unseen data. For our XGBoost model, we cross-validated 5 times. The hyperparameters chosen by GridSearchCV for the final model were:

XGBClassifier(n_estimators = 100, learning_rate = 0.05, gamma=0.5, colsample_bytree = 0.7,subsample=0.5, max_depth = 3,min_child_weight =3)

The results were promising with a training AUC score of 0.96 and a test AUC score of 0.85.

IV. Results

Model Results in Order from Best to Last

V. Conclusion

After playing around with the data we realized that it would be very nice to have a standardized database matching songs, artists, genres, and popularity metrics such as billboard hits, spotify views, grammy awards etc. A lot of time was spent cleaning the datasets to fix discrepancies in the formatting of artist and song names between two different datasets in order to come up with properly labeled train and test sets. This could be a future project that would help data scientists better understand song popularity.

Another insight was that our high scores are most likely due to the fact that we only included songs from 2000 to 2010. Our models would be less accurate for a larger time range as music trends change as the decades go by. As for future goals for the project, it would be incredibly interesting to be able to have a model that takes these music trends changes into account and is able to predict correctly regardless of the changes. We would like to include a feature that would have more standardized genres, as genres are a big part of those changing trends. It would also be good to add record labels as these affect the popularity and marketing of a song, hence affecting the probability of a song charting.

All in all, we realised that there were alot of external factors that determined whether a song would be a Billboard Hit or not. A song’s audio features are not as important as the artist’s popularity, time of release, etc. In the end, as Oliver Sacks put it in Musicophilia, “Music is part of being human” its inexplicable complexities attract us in a way that is embedded in our DNA.

References

Ay, Y. E. (2020, November 25). Spotify Dataset 1921–2020, 160k+ Tracks. Retrieved December 10, 2020, from https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

Colorlib. “Finding the next Billboard #1 Song.” Portfolio, sudharshan-ashok.github.io/spotifyeda.html.

DB: What is a decibel? (n.d.). Retrieved December 10, 2020, from https://www.animations.physics.unsw.edu.au/jw/dB.htm

Get Audio Features for a Track. (n.d.). Retrieved December 10, 2020, from https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/

Siegal, R. (2009, December 31). The Loudness Wars: Why Music Sounds Worse. Retrieved December 10, 2020, from https://www.npr.org/2009/12/31/122114058/the-loudness-wars-why-music-sounds-worse

Appendix

List of feature definitions

Numerical

- acousticness (Ranges from 0 to 1): A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic

- danceability (Ranges from 0 to 1): Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

- energy (Ranges from 0 to 1): Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

- duration_ms (Integer typically ranging from 200k to 300k): The duration of the track in milliseconds.

- instrumentalness (Ranges from 0 to 1): Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

- valence (Ranges from 0 to 1): A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

- tempo (Float typically ranging from 50 to 150): The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

- liveness (Ranges from 0 to 1): Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

- loudness (Float typically ranging from -60 to 0): The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

- speechiness (Ranges from 0 to 1): Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

- year (Ranges from 1921 to 2020): year the track was released.

Dummy

-mode: 0 = Minor, 1 = Major

-explicit: 0 = No explicit content, 1 = Explicit content

Categorical

-Key: All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…

-Artists: List of artists mentioned)

-release_date: Date of release mostly in yyyy-mm-dd format, however precision of date may vary

  • name: Name of the song

--

--