Rapcast

Forecasting Rappers' Hometowns Based On Their Lyric Corpus


Abstract

While many musicians like to pay homage to where they’re from, rappers do so as if it’s a cardinal rule. They represent their hometowns constantly in their music. Some even go so far as to tattoo symbols of their cities onto themselves (see: Drake’s tattoo of Toronto’s area code, and Lil Wayne’s tattoo of the fleur-de-lis, a symbol from the New Orleans city flag). While rappers often mention their hometowns by name in their music, they also display their regional identity in more subtle ways. By using slang specific to where they’re from, rappers naturally root their work in the vocabularies of their neighborhoods.

Our project predicts where rappers are from based on their lyric corpus. The project highlights the linguistic diversity of the rap genre, and the regional specificity of rappers’ vocabularies. As rap continues to grow in popularity, this project is proof that rappers continue to represent their hometowns, and to some degree, at least in the lyrical sense, stay true to their roots.

In order to complete our project, we trained a Linear SVM learner on the lyric corpuses of 300 of the world’s most popular rap artists. Rappers’ lyrics were scraped from Genius and processed in a bag-of-words format. We found that rappers’ lyrics can be predictive of where they’re from, and our Linear SVM model achieved 63% accuracy on average in predicting rapper’s regions, and 56% accuracy on average in predicting rapper’s cities. The Linear SVM model consistently outperformed Random Forest, Multinomial Naive Bayes, and Logistic Regression classifiers.

As we changed the granularity of our target class from region (ex: Northeast, West) to city (ex: New York City, Los Angeles) to sub-city (ex: Brooklyn, Queens, Compton, Long Beach), the accuracies of our our classifiers generally dipped. We found that while lyrics can be extremely predictive of a rapper’s region or metropolitan city, pinning down a rapper's town of origin given their lyric corpus was difficult given the amount of data we had available.

Data

We started our work by creating a dataset of rap lyrics. We picked out 300 of rap’s most popular artists from the 1980s to present day, and gathered lyric corpuses for each artist.

To do this, we first scraped the Genius API for song urls. For each artist, we collected the urls of up to 500 of their most popular songs, with each url being a link to the song’s annotated lyrics on the Genius site. This scraping resulted in over 78,000 song urls. We then scraped again to extract lyrics from every song url, creating a body of lyric corpuses for each of our 300 artists, as well as thousands of other artists featured on the songs of the 300 artists in question.

With our lyric data finally scraped, we manually searched through Wikipedia to determine the hometowns of our 300 artists. We assigned artists a region (ex: Northeast), city (ex: New York), and sub-city (ex: Queens), so that we could later adjust the granularity of our target class in training and testing.

In addition, we removed stop words from our lyric corpuses, and processed the corpuses into a bag-of-words format. We used a logarithmic form for the frequency of each word vector in our bag-of-words, and normalized our vectors so that they would sum to one. We also processed words using a tf-idf approach in order to reflect the importance of each word to its respective lyric corpus.

Hover and click on the dots to see our selected artists' hometowns:



Playing With Our Data

Excluding stop words, below are the 20 most common words in English according to Google's Trillion Word Corpus, as well as the 20 most common words in rap lyrics as determined by our data of over 78,000 songs.

English

  • new
  • home
  • page
  • search
  • free
  • information
  • time
  • site
  • may
  • news
  • use
  • see
  • contact
  • business
  • web
  • also
  • help
  • get
  • pm
  • view

Rap lyrics

  • got
  • get
  • ni**a
  • know
  • ni**as
  • sh*t
  • f*ck
  • b**ch
  • go
  • back
  • see
  • money
  • make
  • man
  • ya
  • love
  • never
  • say
  • want
  • wanna



Before training on our data, we generated the most correlated unigrams and bigrams for each region, city, and sub-city in our dataset, as well as the 20 most common words (excluding stop words) of each artist in our dataset. Search for your favorite artist in the following drop-down menu to view their most common words, as well as the most correlated unigrams and bigrams of their city!

We highly encourage reading up about some unigrams or bigrams that catch your eye. While not all of the most correlated words will directly relate to their cities, many of them can be interesting to look further into. We used Urban Dictionary to find much of this information.

  • Chicago - "3hunna": slang for the set of the Black Disciples gang located in the south side of Chicago.
  • Chicago - "lamron": slang for Normal, the street that runs through the neighborhoods of many Chicago rappers (notice that lamron is just normal spelled backwards).
  • Philadelphia - "state prop": refers to State Property, a division of Jay-Z's Rocafella label whose members are all originally from Philly.
  • New York - "flatbush" and "bronx": both neighborhoods in New York City.
  • Detroit - "313": the city's area code.
  • Houston - "swishahouse": a music label created based on the popularity of "chopped and screwed" music, which originated in the south side of Houston.
  • Los Angeles - "parental discretion": a reference to N.W.A, a highly influential rap group from L.A.

Methods

We tested four different learning methods over datasets with varying amounts of discrete classes, using Python’s SciKit-Learn package to train all of our classifiers. Because our data was processed into a bag-of-words format, we decided to experiment with Linear Support Vector Machine, Multinomial Naive Bayes, Logistic Regression, and Random Forest classifiers for our multi-class text classification.

  • Linear Support Vector Machines: Perceptrons that work in a derived feature space and maximize margin. Work well with high dimensional data and are helpful to measure distances between classifications.
  • Multinomial Naive Bayes: Bayes’ Rule with strong conditional independence assumptions. Easy to build and use on large data sets. Helpful for understanding our classes’ linear independence. Although it is simple, it is known to outperform more sophisticated classification methods.
  • Logistic Regression: Assigns data points to discrete classes. Unique from Naive Bayes in that it does always classify its training data. Once again, helpful for determining linear independence, and recommended by many for text classification.
  • Random Forest: Ensemble learning method that applies multiple decision trees over subsets of data to then output an average tree. Helpful for identifying the most relevant features for a classification

Before training, we ranked which words (features) were most correlated with each location. It was important for us to understand the most correlated unigrams and bigrams for each region, city, and sub-city in our data because it gave valuable insight on which words were most informative, insight that would prove harder to back out from our trained models.

We trained all our models using 10-fold cross validation on our dataset containing the lyric corpuses of all 300 artists and their respective regions (ex: West, Northeast). We then increased the granularity of our experiment by changing our class from an artist’s region to an artist’s city, and then to an artist’s sub-city. In addition, we trained our models on artists from only one particular era (ex: 1980s, 1990s) in order to illustrate the evolving nature of rap slang.

Results

Linear Support Vector Machines (Linear SVMs), proved to be especially effective for our data. By putting the data in a derived--higher dimensional--feature space and working to maximize the margins between classes, Linear SVMs created a more clear line of separability between classes than any other model we trained. Since we wanted to predict an artist’s origin, we wanted to maximize the space between each class to get a more clear division for classification. For these reasons, datasets with more artists and fewer location classifications yielded the most accurate predictions in our results.

Below are our results using all artists and classifying according to region, city, and sub-city. As described above, our classifiers generally decreased in accuracy as the number of location classes increased. And as can be seen below, Linear SVMs consistently outperform all other classifiers, regardless of the granularity of the classification task. Given the difficulty of this task, we think that the accuracies achieved by our Linear SVM are fairly good. We could predict artists’ origins with greater than 50% accuracy based off the language they use. Using logistic regression as a baseline, we found that the difference in accuracy achieved by our Linear SVM model was statistically significant at the region level (p-value of 0.0325), city level (p-value of 0.0005), and even sub-city level (p-value of < 0.0001):

All artists, classified by region:

The highest prediction accuracy is reached using Linear SVC. Note that this is also the highest accuracy achieved over any other classification method. The high accuracy levels for all models is due to the large number of artists with fewer possible classifications.

All artists, classified by city:

The classification accuracy abilities of all models decrease due to the increase in classes. The highest prediction accuracy is still reached using Linear SVC.

All artists, classified by sub-city:

The classification accuracy abilities of all models decrease once again, once again due to the increase in classes. The highest prediction accuracy is still reached using Linear SVC.

Looking Forward

We think this project has room to improve, and in the future, would like to train our model with a larger dataset, which we think will improve its accuracy. This larger dataset would include new artists as well as more songs for each artist.

We think it would also be interesting to experiment with Recurrent Neural Networks to build a language sampler model for a given artist based off their lyric corpus. Lastly, we think it would be worthwhile to run similar experiments on other genres of music to best contextualize our findings.

Another thing of note: when we trained on data from one particular decade our accuracies were higher, suggesting that rap slang evolves and that it's easier to predict a rapper’s origins when training a model withinin a single era of music. Looking ahead, a larger dataset would allow us to further test this hypothesis, and ensure higher accuracies aren't a product of small test data. See our data by decade here.

About Us

This project was made by Michael Cahana, Josh Klein, and Maxine Whitely. While we all contributed to all aspects of the project, Michael oversaw data scraping and processing, Josh oversaw model training, and Maxine oversaw website building. All the code used to create this project can be found on GitHub. This project was completed in EECS 349: Machine Learning at Northwestern University.