Author: John Cambefort

Date: 11/15/2021

This report is a polished excerpt of a report I wrote for my Statistical Learning course at Middlebury.

The first half of this Observable notebook I created contains slightly more elegant versions of some of the graphs below.

We are working with a dataset containing a list of songs that were on Spotify’s Top 200 Charts at some point in between January 1st 2020 and August 16th 2021; the dataset was uploaded by Kaggle user Sashank Pillai. I was interested in reverse-engineering the “Popularity” metric that Spotify attributes to songs. Specifically, I want to know which variables are more heavily considered when determining song Popularity (a ranking from 0-least popular to 100-most popular).

I’ll begin with a few exploratory graphs of the data, proceed to filter down the data (for reasons explained below), and then build an optimized random forest model that best predicts a song’s popularity given the training data. We will then look at the random forest’s Variable Importance table to determine which variables are more heavily considered by Spotify’s algorithm for determining a song’s popularity ranking!

Our dataset was last updated by the Kaggle user on August 16th 2021 (“Version 2”).

library(tidyverse)
library(caret)
library(doParallel)
library(scales)
library(foreach)
library(lubridate)

# Local set up & register parallel backend
numCores <- detectCores() - 1
cl <- makePSOCKcluster(numCores)
registerDoParallel(cl)

spotify <- read_csv("spotify_dataset.csv")
## Rows: 1556 Columns: 23
## -- Column specification ------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (8): Week of Highest Charting, Song Name, Artist, Song ID, Genre, Relea...
## dbl (14): Index, Highest Charting Position, Number of Times Charted, Artist ...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
options(scipen = 10000)

Let’s start by briefly exploring our dataset.

spotify %>%
  ggplot() +
  geom_point(aes(y = Streams,
                 x = Popularity)) +
  ggtitle("Song Stream Count and Popularity") +
    scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) + # millions
  theme_bw()

We notice that, to make it on the Top 200 Spotify charts, songs tend to have a Stream count of at least ~5 million. That being said, it does look like a song’s stream count is not very closely correlated with Popularity: there are many songs with stream counts of roughly 5 to 10 million which all have wildly different Popularity values (ranging from 25 to 85). Beyond Popularity = 85, Popularity increases (somewhat) linearly as Stream count increases exponentially. But this accounts for few songs in our dataset.

Let’s try looking at Highest Charting Position (lower is better / higher on the charts).

spotify %>%
  ggplot() +
  geom_point(aes(y = `Highest Charting Position`,
                 x = Popularity)) +
  ggtitle("Song Popularity & Highest Charting Position (lower is better)") +
  theme_bw()

Likewise, we see a huge chunk of songs with similar Highest Charting Positions that have wildly different Popularity scores. Really, we want to observe some sort of an exponential or linear relationship between a predictor variable and our response variable (Popularity).

We note that in the above two graphs, there are a number of songs for which Popularity is roughly equal to 0. Further investigation reveals that these are songs which were very recently released, i.e. some time within the last month of the dataset’s creation. These songs are clearly outliers (Spotify’s model has yet to decide what Popularity to assign to these, since they’re brand new, or for other reasons, so it goes with 0). To handle this, we’ll filter out songs older than July 21st 2021, a little less than a month before the dataset’s creation.

There are also a few songs that date back to the 20th century and to the early 2000s. To simplify things, we’ll just look at songs released within the last couple of years (as of January 1st, 2020).

Lastly, we’ll filter out any songs from our dataset that were released after they peaked on the charts, because this isn’t possible, as well as songs which charted more weeks than possible given their release date. I won’t deny that this dataset is a little confusing and I’m not quite sure how those songs ended up on there, or how those values were defined.

We’ll be looking at most of the variables provided in the original dataset, except for any unique identifier variables (e.g., song ID, Artist name or Album name), and I’ve also excluded Genre and number of Weeks Charted (quite simply, I didn’t have time to finalize converting the genres to a usable format, and pivot_wider() was taking too long to wrestle with given the data).

# Clean up the data
clean_spotify <- spotify %>%
  mutate(`Release Date` = as.Date(`Release Date`)) %>%
  select(-`Song Name`,
         -`Artist`,
         -`Song ID`, # keep song ID so we can keep track of songs?
         -`Weeks Charted`,
         -`Genre`, # Couldn't figure out an easy way to format Genre in time for the deadline.
         -`Index`) %>%
  na.omit()

# Remove songs that charted more weeks than physically possible, given their release date.
# A song cannot have charted more times than the number of weeks between its release date and july 21st 2021.
clean_spotify <- clean_spotify %>%
  filter(round(as.numeric(difftime(as.Date("2021-08-16"), as.Date(`Release Date`), units="weeks"))) >= `Number of Times Charted`)

# Preserve the first Date at which a song peaks on the charts (relatively). Call this HighStart.
# We will also mutate on a column (Called timeToHigh) representing the # of days between the song's
# release date and when it peaked on the charts.
highest_chart_week_start <- str_split_fixed(clean_spotify$`Week of Highest Charting`, "--", n=2) %>%
  data.frame() %>%
  mutate(HighStart = as.Date(X1)) %>% # convert to Date object
  select(HighStart)

# Add HighStart back on to the dataset
clean_spotify <- cbind(clean_spotify, highest_chart_week_start) %>%
  select(-`Week of Highest Charting`)

# We will only look at songs released in the last two years that are not too recent, so filter
# out songs that don't correspond to the criteria (also filter out songs that were released after
# they peaked on the charts, because this isn't possible).

final_spotify <- clean_spotify %>%
  mutate(`Release Date` = as.Date(`Release Date`)) %>%
  filter(`Release Date` < as.Date("2021-07-21")) %>% # Look at only songs from the last 2 years
  filter(`Release Date` > as.Date("2020-01-01")) %>%
  mutate(timeToHigh = as.numeric(difftime(`HighStart`, `Release Date`, units="days"))) %>%
  filter(`Release Date` <= `HighStart`) # there's no reason a song should hit the charts before it's released

This leaves us with 980 songs, which is enough to work with.

We’ll now build our Random Forest model to detect variable importance. The code below will identify a few optimal hyperparameters to use for our model. We’ll be using the ranger RF package, as it is generally faster to run than the original RandomForest package.

best_results <- NULL

system.time({
  for(i in 1:100) {

    rf <- caret::train(Popularity ~ .,
                data = data.frame(final_spotify),
                method = "ranger",
                tuneGrid = data.frame(mtry = c(2, 5, 10, 12, 13, 14, 15, 16, 18, 20),
                                      splitrule = "variance",
                                      min.node.size = i),
                importance = "impurity")
    
    best_mtry <- rf$bestTune$mtry
    best_min_node_size <- rf$bestTune$min.node.size
    
    best_results <- rbind(best_results, rf$results %>% filter(mtry == best_mtry))
  }
})
##     user   system  elapsed 
##  192.968    4.954 1096.351
index <- which(best_results$RMSE == min(best_results$RMSE))
best_results[index, ]
##    mtry splitrule min.node.size     RMSE  Rsquared      MAE    RMSESD
## 26   16  variance            26 4.263817 0.7939816 3.221774 0.2215335
##    RsquaredSD     MAESD
## 26 0.01893885 0.1442048

We’ve identified our optimal training parameters for the random forest and it seems we’ve yielded some surprisingly good results from the model. The R-Mean Squared Error is as low as ~4.264, which means that we’re off by roughly this many on average when attempting to predict a song’s Popularity. Likewise, 79.40% of the variation in our original data’s popularity values can be explained by our variables, meaning that the model fits the data quite well. In our scenario, we won’t be testing new data on our model or trying to predict the popularity of songs that our model hasn’t been trained on; rather, we’re mostly interested in seeing which are the most important variables used by our model to explain our predicted Popularity metric, so ‘overfitting’ the data isn’t a major concern here.

After testing out the extratrees parameter for splitrule, we observed that variance was optimal.

Let’s now build the model using the optimal parameters, and then go ahead and examine it’s Variable Importance table!

optimal_rf <- caret::train(Popularity ~ .,
                data = data.frame(final_spotify),
                method = "ranger",
                tuneGrid = data.frame(mtry = best_results[index, ]$mtry,
                                      splitrule = "variance",
                                      min.node.size = best_results[index, ]$min.node.size),
                importance = "impurity")

varImp(optimal_rf)
## ranger variable importance
## 
##   only 20 most important variables shown (out of 27)
## 
##                            Overall
## Number.of.Times.Charted   100.0000
## HighStart                  82.0593
## Release.Date               23.5023
## Streams                    21.0299
## Highest.Charting.Position  17.9335
## timeToHigh                  7.8641
## Artist.Followers            5.0995
## Speechiness                 4.1047
## Loudness                    3.6248
## Duration..ms.               3.1701
## Energy                      3.0525
## Liveness                    2.9178
## Acousticness                2.8180
## Danceability                2.5113
## Valence                     2.2298
## Tempo                       1.7286
## ChordD#/Eb                  0.9223
## ChordC#/Db                  0.2777
## ChordF#/Gb                  0.1343
## ChordA#/Bb                  0.1303

The model claims that Number of Times Charted is one of the most important variables in predicting Popularity. Let’s check out the correlation between these two variables:

cor(final_spotify$Popularity, as.numeric(final_spotify$`Number of Times Charted`))
## [1] 0.4706765

The term above is the correlation factor between Popularity and Number of Times Charted (a value closer to 1 implies a greater correlation between the two variables). Let’s see how it compares to the correlation factor between Popularity and HighStart:

cor(final_spotify$Popularity, as.numeric(final_spotify$HighStart))
## [1] 0.6178207

It seems that HighStart, i.e. the date at which a song first peaked on the charts, has a way higher correlation factor with Popularity than Number of Times Charted. Why is the Random Forest predicting that Number of Times Charted is more important than HighStart, in that case?

Here is the relationship between HighStart and Popularity:

final_spotify %>%
  ggplot(aes(x = `HighStart`,
                 y = Popularity)) +
  geom_point() +
  ggtitle("Song Popularity & HighStart") +
  geom_smooth() +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This graph makes us believe that songs generally lose Popularity over time. Indeed, we see that songs that peaked a while ago (i.e., that have been around for a while) have lower Popularity scores at the time that this dataset was generated. In fact, it’s reasonable to say that HighStart is quite similar in nature to Release Date.

Let’s quickly look at the correlation factor between Popularity and Release Date.

cor(final_spotify$Popularity, as.numeric(final_spotify$`Release Date`))
## [1] 0.5452958
final_spotify %>%
  ggplot(aes(x = `Release Date`,
                 y = Popularity)) +
  geom_point() +
  ggtitle("Song Popularity & Release Date") +
  geom_smooth() +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The above two graphs are essentially similar, and they tell us that the older the song, the lower its popularity score is at the time the dataset was captured. The newer it is, the higher the popularity. That generally makes sense – newer songs that are or recently were on the Top 200 Charts are considered ‘Hot’ and are currently (or recently were) popular.

This also tells us that Spotify might have some sort of decay formula involved in its determination of Popularity where, over time, a song’s popularity generally decays.

A question that remains is perhaps why our model is giving so much importance to the number of times that a song was charted? A hypothesis is that this variable is actually somewhat related to a song’s release date: the longer it’s been around for, the likelier it is that the song has “charted” more, so the model can tell that it is older.

We can try graphing these two to have a look:

final_spotify %>%
  ggplot(aes(y = `Number of Times Charted`,
             x = `Release Date`)) +
  geom_point() +
  ggtitle("Release Date & Number of Times Charted") +
  geom_smooth() +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

That graph makes sense – the longer a song has been around for, the more time it may have had to spend on the charts, so the greater the number of times the song has charted.

But hang on – let’s plot out Popularity against Number of Times Charted, and color by Release Date.

final_spotify %>%
  ggplot(aes(x = `Number of Times Charted`,
             y = `Popularity`,
         color = `Release Date`)) +
  geom_point() +
  ggtitle("Popularity, Number of Times Charted & Release Date") +
  theme_bw()

This is interesting. Even though we previously found that older songs have (naturally) charted more, and that newer songs are more popular, here we see that songs that charted more are also more popular (at least 10 times).

What this last graph tells us is that newer songs don’t chart much, but have high Popularity scores, while some old songs (certainly not the majority) have charted a lot and, consequently, have high Popularity scores. That’s reasonable: “oldies but goodies” are still popular. An example is Dua Lipa’s “Break my Heart” with a Popularity score of 80 and a Release Date of March 2020.

These “oldie but goodie” songs are still listened to, to the point that Spotify chooses not to decay their Popularity too much, since they stay on the charts so much of the time.

As a result, it’s fair to say all of the most important variables of our model seem to be related to time: either how recent a song is (new hits smash the charts), or how many times a song has persistently stuck on the charts. We can state our tentative finding that the Spotify Popularity metric seems to decay with time if a song drops off the charts.