Predicting Billboard Hot100 Song Hits

Sep 5

What is a song? On Merriam-Webster, it states that a song is merely a “short musical composition of words and music”.
That leads us to our problem statement:

How can we analyze a song’s words and music such that we can predict the probability of its success?

Before answering that question, we need to define what success means first. In this project, we are using the Billboard Hot100 chart as a measurement of success. Since 1958, Billboard Hot100 has been the music industry’s standard record chart in the United States for the top 100 songs. The algorithm is based on physical and digital sales, radio play, and online streaming activity from platforms like Spotify, Apple Music, YouTube, Pandora, etc.

Here are several milestones to keep in mind as you continue reading below. This project covers the period between 2008 and 2021 as my initial intention was to identify k-pop songs within Billboard Hot100. As you can see, the first k-pop appeared on the charts back in 2009. However due to lack of data, I changed the business question to cover any songs played within the U.S.

The streaming era is shaping the way we produce, distribute, and consume music. It presents an opportunity for new and emerging artists to showcase their music and make their “big break” through channels like TikTok, Soundcloud, or YouTube. In fact, Spotify’s SEC filings have shown that major label market share has been steadily declining since 2017. Now, both streaming companies like Soundcloud and major label companies are keeping their eyes wide open for the next new and emerging artist to make a viral hit.

Business Value:

Because of this recent and rapid change in music, companies are taking a reactive approach. In other words, they approach a new artist after their song goes viral. This project will construct a model to help both companies and artists alike to be proactive and produce a song that can be widely commercialized while keeping up current trends.

About the Data

To compile the most recent data meant acquiring data manually. With Python’s BeautifulSoup package, weekly statistics from Billboard.com was scraped from 2008 to May 2022 (time of the project). Next, song lyrics were extracted through Genius API and audio feature information was extracted through Spotify API.

Additional tidbits discovered during this phase:

Some artists want solely radio-play and choose to opt out of streaming platforms
Country restrictions or licensing changes can prevent songs from being listed on Spotify
Spotify does not have songs that made exclusive deals with Apple Music
Spotify does not have replays of live performances - must resort to YouTube
Songs without enough release date information default to January 1st of the given year

Since this was a binary classification problem, a target variable called “billboard” was created, “1” being songs listed on Billboard Hot100 and “0'“ being songs that were not. Because there was no data to represent class 0, data was sampled from Kaggle and data.world datasets.

Data Wrangling

Data cleaning was perhaps the most difficult part of the project. Since the data was pulled from five different sources, consistency and integrity were the most important goals. Various NLP methods were used to create a unique identifier; in this case, song title and artist name. Python alone was not smart enough to realize that “Bam Bam by Camila Cabello” was the same object as “Bam Bam Feat. Ed Sheeran by Camila Cabello”. Moreover, it was important to double-check the data since a significant portion was user-generated.

Once the data was cleaned, the next step was featuring engineering. Two new features were created:

One column labeled artists as either a “major label” (1) or “new and emerging” (0), based on when the artist’s debut song was released. The threshold was 2017 based on Spotify’s findings that major label market share had been declining since then.
The second feature was an aggregation of Spotify’s 1,000+ unique genre tags of each artist. Spotify does not tag artists with traditional genre categories; instead, it clusters music by listening patterns. It was evident that online streaming platforms made it possible for new genres to emerge, often fusions of the globally accessible music.

The left shows a snippet of Spotify’s 5,000+ genre classification labels. This workpiece was developed by Glenn McDonald, Spotify’s Data Alchemist. Check it out

The final products were two separate datasets:

The audio dataset has 13,807 rows and 44 columns in which 6 columns have string data type and the remaining 38 columns are numeric. The numeric data consists of both continuous and binary measurements which meant the models needed to be fit on a Standard Scaler, MinMax Scaler or a combination of both – whichever generated the best results. This dataset covered information such as Billboard statistics, song data, audio features, and feature-engineered columns.
The second dataset has 14,024 rows and 8 columns in which 5 columns have string data type and 3 columns are numeric. This dataset had song lyrics for NLP analysis.

While it was ideal to merge these datasets and create a holistic model, they were kept separate in this project for interpretability purposes.

Exploratory Data Analysis

Audio Dataset

There were two metrics to measure time - the year that the song was released (release_year) and the year that the song was listed on the Billboard (billboard_year). When looking at only class 1 (Billboard), release_year did not have a perfectly linear relationship with billboard_year.

When the data points that fell below the trend line were extracted, results showed that these were primarily holiday songs from the 1960’s to the 1980’s. Moreover, almost all of these were in the January charts, meaning the song activity and plays were carried over from December.

More importantly, this chart implies an important consumer behavior: listeners like to reminisce on the good ol' classic Christmas music during the holidays. That led to a couple of assumptions:

While producing Christmas music in November/December may present itself as an opportunity, the competition is more saturated if the goal is to reach the Billboard chart.
Listeners are not necessarily always looking for something "new and exciting".

The two figures on the right show a Word Cloud of artists based on the number of songs that made it to the Billboard Hot100 chart.

When looking specifically at the years of 2008 and 2009, artists like Glee, Adam Lambert, and David Cook produced many of the Billboard song hits during those years.

What these artists have in common is that they mainly sang song covers of much older songs, many of which did not reach the Billboard charts when they were originally released! This confirms the previous claim that reminiscence is a key factor for listeners.

Moving along, the data web scraped from billboard.com had an important metric to analyze: weeks_on_chart. When visualized as a box plot (not shown), there was a downward trend of each yearly median while longer weeks on chart gradually became outliers.

Does this mean that listeners get tired of songs more quickly? Or does it mean listeners have access to many more talented artists? Perhaps both. The streaming era opened up doors for listeners to enjoy not only what the radio stations play for them but now also music from different countries and sources, creating fusions of genres and personalization for each individual listener.

The animated Plotly Express bar plot (on the right) was created with the new feature that was engineered during the Data Wrangling process. The genres are listed in descending order from 2008 year-end results, indicating that any change seen throughout the animation is a deviation from the original trend. In fact during the animation, the “hiphop” genre is noticeably getting more popular through the later years while “pop”, once the most popular, slowly loses share to other genres.

Lyrics Dataset

The preprocessing and vectorization for song lyrics was a more challenging part of this project. Because songs tend to be repetitive, the TF-IDF vectorizer from sklearn was used (as opposed to Bag of Words which would be biased towards high-frequency tokens). Other transformations included “formalizing” slang and colloquial spelling, removing non-ASCII characters, and extending stop words to include vocalise (e.g. “ahh”, “ooh”) and song composition (e.g. “into”, “chorus”). Parameters like min_df and max_df were also adjusted to narrow down to meaningful tokens.

Because the process above produced over 2,000 features (or unique words), an unsupervised algorithm called LDA model was run to cluster meaningful words into common topics. This time, spaCy and gensim packages were used instead of sklearn.

See my LinkedIn post which briefly explains how to read and understand the model.

This chart indicated that:

Songs about romance and heartbreak were always popular, but they have become exponentially popular in later years.
On the other hand, hardcore music and songs about life are seeing a decline.
Explicit and Latin music have seen an uptick recently and this could be due to how music is becoming more inclusive and globalized.

Model Performance

Audio Dataset

The models that were run on the audio dataset were Logistic Regression and various tree models. The Logistic Regression ran a 81.2% accuracy score, while the Random Forest model held the best score of 87.6% followed by Ada Boost with 83.9%. The columns with string data were dropped so the algorithm was based on a total of 37 features.

The strongest coefficients for the Logistic Regression model included track popularity, genre screen, and artist popularity for top positive and duration, release year, and energy for top negative. (Note: genre screen was part of the new features created by aggregating the Spotify genres. This particular genre included songs that appeared “on screen”, including movie soundtracks or song covers from shows like Glee or American Idol). The top features for the tree models by permutation importance included track popularity, artist popularity, artist followers, release year, and duration.

The following conclusions were made after evaluating these top features:

The higher the track popularity score, the more likely a song is classified as a Billboard hit song. In hindsight, this feature should have been dropped from the model as it is most likely to be collinear with Billboard Hot100. (i.e., if it’s popular here, it’s probably popular there)
Songs with a longer duration (ms) tend to have a negative impact on being classified as a Billboard song.
It matters who sings the songs. Artist popularity and artist followers are significant features.
Songs with a high level of energy (described as “fast, loud, and noisy” are less likely to reach the Billboard charts.
The earlier the song is released, the more likely it was to end up on Billboard Hot100. While this sounds irrational, EDA on the relationship release year vs. billboard year has shown that listeners still liked to reminisce on older songs, especially during the holidays.

Lyrics Dataset

The lyrics dataset was evaluated through the Logistic Regression and Support Vector Machine models. After preprocessing the lyrics, the dataset ended up having over 2,000 tokens (or words). To prevent overfitting, a method called Principal Component Analysis (PCA) was used for dimensionality reduction. However, all models, with and without PCA, performed similarly at 64% accuracy.

When comparing the top positive and negative coefficients, there appeared to be a noticeable difference between the two. Overall, the positive coefficients evoked a fun-spirited feeling while the negative coefficients had a dark/gloomy context. It was also interesting to see that the choice of words made a big difference (e.g. dawg vs. dude) or that the topic of alcohol (e.g. whiskey, drink, beer) was more popular than that of drugs (e.g. “weed”).

Final Thoughts

For future model enhancement, the following would be considered:

Use GridSearch to find the optimal max_df and min_df parameters
Adjust the pre-processing method for text noise removal
Try lemmatization the instead of stemming
Add new features that measure or identify: rhyme, repetition, seasonality, artist pitch, artist gender, solo vs. group, etc.
Remove track popularity as a feature since it is collinear with Billboard Hot 100
Address bias in the data samples from Kaggle and data.world

These models not only validated our initial claim of evolving music trends but also provided actionable insights for music labels and artists.

There are no hard boundaries in music genre. Listeners are ready to embrace something new and more personalized.
Songs are spending less time on the Billboard charts, indicating a high turnover rate. While this may seem like listeners get tired of songs easily, it also opens up doors for other artists/songs to enlist on the chart.
Listeners still like to reminisce on older songs, especially during the holidays. Producing covers of older songs may seem risky and unoriginal, but data has shown that these cover songs can still reach the Billboard chart.
Songs about romance/heartbreak or songs with fun-spirited word choices have a higher likelihood to reach the Billboard chart. On the contrary, loud/noisy songs with lyrics that evoke negative feelings have the opposite effect.
Artist popularity was an important feature in all the models. For new artist, it may help to feature or collaborate with a recognizable artist.

Back to Portfolio

View Full Code

Hailey Lee https://www.ejhailey.com