notorious-nlp - Tumblr blog

notorious-nlp · 8 years ago

Text

OBNOXIOUS TOPIC MODEL LABEL FAVORS

Today, I tried to see if a computer could “guess” some of the different things that hip hop is about. This is a shallower question than you think -- I’m not trying to get my laptop to understand anything about the political or cultural climate in which hip hop exists. I just wanted to see if it could comb through my hip corpus and give me a reasonable summary of what ideas hip hop songs tend to touch on.

To do this, I used something called a topic model. Topic models are mostly used to extract recurring ideas out of larger datasets -- say, a massive corpus of Tweets or news articles. They compare the distributions of words in each document to determine words that are disproportionately important, and then are able to group similar disproportionately important words together. The basic idea is that, when you look at the words that the algorithm clusters together, you’ll be able to get a rough idea of what the entire collection of documents is “about”. Importantly, this model acknowledges that one document (or set of documents) can be “about” multiple things (e.g., the way this blog is about both NLP and hip hop).

Topic models are unsupervised learning algorithms, so they output unlabeled clusters of data. As a result, initially reading the output of a topic model is unintuitive: the output won’t say “love is one topic of hip hop songs”, but will instead give you a cluster of words that represent or connect to a topic (e.g., love, oh, girl, come, back, life, need, good, want). Some of the clusters will be repeated and/or nonsense. Others will be insightful enough that you, as a human, can find a common thread between each word in the cluster. That thread is your topic. Whether or not two different people are able to see the same common thread across a cluster is a good indication of how the topic model is doing.

Here are some reasonable clusters that the model spit out:

Life and death: man (0.0076), tech (0.0056), come (0.0051), gonna (0.0047), die (0.0047), god (0.0041), would (0.0040), world (0.0040), back (0.0035), time (0.0033), got (0.0032), said (0.0031), say (0.0030), better (0.0030), stop (0.0029), still (0.0029), live (0.0028), around (0.0028)

Raps about rapping: rap (0.0067), mc (0.0061), shit (0.0060), mic (0.0057), rhyme (0.0054), rhymes (0.0053), em (0.0053), black (0.0051), time (0.0048), check (0.0042), rock (0.0037), kid (0.0035), style (0.0035), hip (0.0034), new (0.0034), hop (0.0032), never (0.0032), back (0.0031)

Love-ish: love (0.0500), oh (0.0192), baby (0.0182), yeah (0.0157), make (0.0130), come (0.0128), feel (0.0124), time (0.0117), say (0.0091), need (0.0085), good (0.0084), gonna (0.0083), boy (0.0065), world (0.0064), take (0.0063), think (0.0062)

Songs you won’t play around your parents: shit (0.0130), fuck (0.0110), nigga (0.0109), ass (0.0104), wit (0.0087), gotta (0.0086), make (0.0082), bitch (0.0072), da (0.0064), man (0.0062), say (0.0060), niggas (0.0057), aint (0.0054), come (0.0051), take (0.0049), never (0.0047), high (0.0044), hoes (0.0043)

These results come from asking the topic model to produce 50 clusters, and run through the dataset 15 times (each run through will improve the results slightly). I told the model to ignore stopwords (like the) and also a set of specialized hip hop stopwords, as determined by the LLR and weight algorithms I posted about earlier.

We can also use topics to compare subsets of the data. For example, we can run the topic modeler on only female artists in the dataset (just so you know how I’m doing here, only about 10% of the artists in my dataset are female :/). Notably missing were the raps about rapping and the songs you won’t play around your parents topics, but there’s a very good chance that’s due to a sampling error (only 10% of the artists in my dataset are female).

Run the world imma (0.0158), life (0.0140), change (0.0130), shit (0.0111), back (0.0098), make (0.0094), money (0.0079), love (0.0078), time (0.0072), never (0.0072), need (0.0068), work (0.0062), big (0.0061), bitch (0.0059), new (0.0057), world (0.0055),

Who run the world: girls queen (0.0115), baby (0.0104), come (0.0072), oh (0.0066), girls (0.0062), time (0.0059), make, (0.0059), take (0.0046), keep (0.0044), give (0.0043), head (0.0043), never (0.0042), yeah (0.0042), mean (0.0038), say (0.0037),

Great night out love (0.0220), night (0.0197), right (0.0194), feel (0.0193), yeah (0.0188), away (0.0158), life (0.0152), oh (0.0141), give (0.0117), nobody (0.0116), need (0.0110), good (0.0109), club (0.0108), boy (0.0107), crazy (0.0103), looking (0.0099), tell (0.0098), tonight (0.0098), say (0.0094)

Just straight up sexytimes nigga (0.0185), ass (0.0146), dick (0.0123), hoes (0.0108), man (0.0100), back (0.0094), niggas (0.0084), pussy (0.0081), tha (0.0074), fuck (0.0070), baddest (0.0070), bitches (0.0067), make (0.0054), bitch (0.0051), say (0.0046), baby (0.0045), tongue (0.0045), damn (0.0045), girl (0.0042)

If you want to play around with the topic model, it’s available on github. (Full disclosure: the code from this is slightly modified from a long-ago course assignment -- it’s great to play with, but there are better, much more sophisticated tools available, like MALLET.)

0 notes

notorious-nlp · 8 years ago

Text

TRAINING DATA-MAKER, ANTI-HESITATOR

[You thought I forgot about you, didn’t you? It’s a rough time for everyone right now, and so, particularly recently, it’s been easy to let projects like this fall by the wayside. It’s hard to argue that this blog is as important as, say, protesting the illegal detainment of lawful American residents. But, as a friend of mine said: we’re fighting to be able to live the lives we want, and living well is a form of resistance, too. Besides: hip hop often gives a voice to the oppressed. It’s a predominantly Black art form with undeniable Muslim influences that was pioneered by an immigrant. If there was ever a time to celebrate this kind of art, well...it’s now.

I’ll try to post once or twice a month (so: every 3ish weeks) moving forward. In the meantime, here’s a post that I somehow forgot about until now...]

My last post talked about the neural network that I built to generate novel hip hop lyrics. If you want a rough overview of how that model works, read it. This post talks about how the model performed during different stages of the training process, and how I worked to improve it.

Recall that neural networks train through repeated exposure to data. They start out by giving each feature, and each interaction, a random weight. Each time the model makes a prediction, it evaluates whether or not that prediction was “right” (i.e., that the predicted output matches the actual output) and adjusts the random weights so that future predictions will be more accurate, using (something like) gradient descent. This is why, broadly speaking, models that have been trained on larger data sets, and for longer periods of time, perform more accurately.

I actually wrote and trained several models before settlings on the final version, and I want to start out looking at the first draft. This model was trained on a (very small) dataset of 100k characters, because that was the limit of what my computer could nimbly handle. Does 100k characters should like a lot? It isn’t. I started with a relatively simple model -- metaphorically, it was equivalent to a brain with a smaller number of neurons. Generally: fewer neurons mean a simpler hypothesis, but also faster training times. If you want to know more about how the number of neurons (or, units) in a model can contribute to the model’s predictive power, The NYTime’s write up of Google Translate has a really good overview.

Because of the kind of model I chose (a LSTM-RNN) and the library I used to implemented it (Keras) my model trained in epochs -- full passes through the dataset. After each epoch (in this case exposure to all ~100,000 patterns of 101 characters), I established a “checkpoint” that stored the weights associated with the model at that point. Then I ran the model with those weights. I used the outputs of these runs (as well as, you know, actual mathematical accuracy metrics) to determine how well the model was doing after each stage of training.

Here’s how the first model did after 5 training epochs. Again, [seed text] is bolded inside brackets. Model generated text follows. Are you ready?

[similar to saying mama s baby s daddy maybe when we had sex i was in the mercedes a]nd when i cac bucy mh in i m ne bor oht duai se to me mh bol mh oo i oh the loal oh i tou pan a bia l a n a lii io she bar h gon t gan th toe datche ...

Ouch. It’s halfway between English and...Vietnamese? Obviously, we’re not doing very well right now.

Here’s how the same model looked after 15 training epochs:

[i m from the belly of the beast remember i barely used to] lame the toak mo bot t ale the toie aaasnz i make her dance thas sae io toe tooe the toal [...] what happened shas iapp nn thet the saad i mote the bitch i mote the siie [...] i want a whip and a chain i want a whip and a chain

Medium ouch. It almost starts out strong: the first word it produces is actual English, even if the sentence remember I barely used to lame isn’t so promising. Then we get total gibberish. About halfway through, we get a little more actual English (what happened) mixed in with gibberish. At the end, it looks like something promising is happening, but...

...Here’s the model after the full 30 epochs:

[that cr nack yeah i got cr nack started from the trap now i rap] this the shit you play when you sipkin suck it up this the shit you play when you snoke a zip and up this the shit you play when you sippin out a cup [...] this the shit you play when you sippin out a cup this the shit you play when you sippin out a cup this the shit you play when you sippin out a cup [...]

What looked promising after 15 epochs was really a disaster after 30: the model was basically memorizing phrases that were disproportionately common (because, say, they were part of hook of Bentley Truck) and rewarding itself for always predicting those phrases.

Part of the problem was the dataset -- I trained this practice model on too small a sample. But the way the model was built contributed to this problem, too: Always outputting the same result is an underfitting (bias) problem. Basically, this was a symptom that the model was too simple. As a result, the hypothesis that the model held about how to generate language was also too simple to be insightful. Normally, a model that’s “too simple” is a model that doesn’t consider a large enough number of features, or that isn’t looking at enough feature interactions (because, say, it doesn’t have enough neurons to do so). Because my number of features was set (the number of unique characters in my dataset) I dealt with this problem by increasing the number of interactions the model looked at. Basically, I made the model better by giving it more neurons.

Here’s how the final draft (i.e., the more complex version) of the model performed during training. This obviously isn’t a perfect comparison, because the final model was also trained on a larger dataset, but still:

After 5 epochs:

[diamond rings you say i m bad at timing things so what s a man to do when all i hand to you is handed] soees in men mo more to matee ien so take the brain out leave the heart in take the brain out leave the heart in bany mote it paf the bott to the bott to the back in the same io a saaee the same the same i m the one the peelen the cooper so the [...]

It looks like English! Some of the repetition/memorization in the middle (take the brain out leave the heart in take the brain out leave the heart in) looks like the same underfitting problem we saw earlier, but it’s less overwhelming. And this is still the beginning of the training cycle.

After epoch 15:

[motherfucker tis the season to be servin what you doin mob mobbin like a motherfucker] e saaee the siie the saaee the siig she s soal the boas sook doon i meed mo mnee in the middle bou a b t in the way i maae i m gonna find a way to make it without you i m gonna hond in you ra babk to the boat rock the boat change positions new position new position new position stroke it for me stroke it for me stroke it for me [...]

So far, still promising.

And finally, after the last training epoch:

[you should have listened motherfucker when i said don t get it twisted don t get it twisted nigga] het said i m the one to gat mo money more to work to mess with me me yeah yeah yeah yeah yeah yeah yeah [...] let me see you can t say the same i m gonna be alright see the same black out black out black out black out black out black out black out

You can see that the underfitting problem never really went away -- it just stopped being so (glaringly) obvious. In an ideal world, I would have trained an even more complex model, meaning that it would have been able to maintain even more sophisticated hypotheses about language (particularly, I would have trained a model with an additional LSTM layer. This kind of model performed better in some initial tests). But there’s a complexity/training time trade-off to consider. Really, it was a complexity/cost trade-off, because I had to rent a GPU to train this thing. The above is about what I could get for about $10, or a really nice beer.

#underfitting #bias vs variance #lstm-rnn #machine learning #data science #science side of tumblr #nlp #notorious nlp #keras

0 notes

notorious-nlp · 9 years ago

Text

Artificially Intelligent Rhetoric, Model Unpacked (Like a Tenement)

Some more artificially generated hip hop for you:

1. [forever official like words out the scripture] the scared shit the same the pac she she she she sat to make it shat

i m a coast and diner the same she said i m a la la la la la

la la la la la la la la la la la la la la la la la la la la la

2. [as i m bailin down the block that i come from still gotta pack] to the back to the back i m gonna take you back she s a brand new me she s a brand new you

Ok, the big question: how does the model produce this? It’s a big jump from a corpus of actual hip hop lyrics to a model that can (kind of) generate new additions to that corpus, so bear with me. Let’s start at the beginning:

What do I even mean when I call this thing “a model”? Specifically, my model is a neural network. If that wasn’t helpful, here’s a metaphor (if that was helpful, skip the next, like, 7 paragraphs):

Let’s say that you, a human being, are thinking about renting an apartment. There are a lot of features you might consider: how much the rent is, who the landlord is, what neighborhood it’s in, if it’s close to public transit, etc. For each person, some features will matter more than others; I ride my bike everywhere, so I don’t really care about living on the public transit grid. We’ll call this measure of importance a feature’s weight.

Features matter in isolation, but they can also interact with each other. Let’s say that I, an apartment-renter, care about the features “neighborhood”, “my job” and “rent”. “My job” and “rent” interact because my job determines how much money I can spend. “My job” and “neighborhood” interact because they determine the length of my commute. So when I’m looking at apartments, I weigh features like “neighborhood” and “rent” individually, but I also weigh their interactions. If my job changes, that might also change the neighborhoods and price ranges I’m interested in.

Neural networks are (at the risk of vastly oversimplifying) models that can capture the facts that (a) decisions are motivated by a (sometimes very large) set of features and (b) those features can interact with each other. Importantly, in a neural network, I don’t have to specify that there’s an interaction between “my job” and “neighborhood”. The model will try all possible interactions (of all possible features), determine the most important features and their interactions, weight those features and interactions accordingly, and use those determinations to make predictions.

That’s cool, but still I don’t get it. What do you mean by “predictions”? What kinds of features do hip hop songs have?

Basically, if I wanted a computer to be able to guess what apartments I’d want to rent, I would give it, say, 10,000 examples of apartments I did and did not want. I’d organize these into vectors (lists, basically), like:

A computer can’t understand words, so I’d actually convert each value in each vector into a number (e.g., “Back Bay”=50, “grad student”=10, $3,000=30, no=0, yes=1). Now I have an input vector [50, 10, 30] (representing an apartment in back bay for $3000/month if I’m a grad student) and an output vector [0] (representing the fact that I wouldn’t rent that apartment). Then program would, basically, try and multiply the input vector by another vector (the weights!) to get to the output vector. Depending on how close it got to the right answer, it would change the weights slightly, so that it could get closer next time.

Eventually, it would be able to predict that I would rent an apartment in Inman Square for $1500 a month if I were a data scientist, but not if I were a grad student.

Let’s leave the real estate metaphor behind and get back into hip hop.

Picking features is actually pretty easy for generative language models like mine: the standard is to use either words or characters as features. I went with characters; I’ll talk about why later. My model was trained on about 5 million input:output pairs like:

Inputs values were sequences of 100 characters from hip hop songs and output values were the 101th character from that sequence. Again, these values were actually converted to numbers in the model (computers can do more with the integer 15 than they can with the character m). This allowed the model to see some sequence of characters and then make a guess about what character would come next.

Here’s where interacting features come into play. The model is now able to take random characters, like b and l, and make incredibly accurate guesses about how they might relate to each other. For example, our model will “know” that bl is a possible character sequence in the beginning or middle of a word, but not at the end. It does this entirely with statistics: the model (more or less) sees that “[space]bl” is a common sequence in the input data, and so is “[non-space]bl[non-space]” but “bl[space]” never occurs. From there, when it has to generate a character following “l”, if the previous character was “b”, it will know that “[space]” is a statistically unlikely character to come next, so it won’t generate it. By the same logic, it will also be able to capture some subtleties of grammar at the word and sentence level, like subject-verb agreement (e.g., knowing that it’s they were but not I were). It can do all this just by looking at enough data -- I haven’t specified any rules or structure, here. This is astounding. (This is also why machine learning systems live and die by their data. If a system doesn’t have enough data, it won’t be able to make good statistical judgements, and the output will reflect that).

It’s still not magic, though. The model will be very good with phonology and pretty ok with syntax, but it will start to falter as we deal with higher layers with structure. For example, we’ve already seen some lyrics that don’t make a lot of semantic sense (”she she she she sat to make it shat”). Additionally, there’s a lot of song-level structure to hip hop: intros, outros, choruses, hooks, etc. The model will be able to capture some of this--it’s able to encode repetition, for example, and can approximate a rhyme scheme. But I haven’t built this model with any information about how long a song should be, or what components a song needs to have, so we don’t predict that the model will know any of that.

I mentioned this earlier, but this model generates lyrics character-by-character. It doesn’t generate lyrics word-by-word. I chose a character-by-character model because I hoped that would be more flexible when it came to capturing elements like rhyme, alliteration, etc. My intuition, essentially, is that it’s easier to rhyme a syllable than a word (and alliteration is easier if you’re just focusing on characters), and a character-by-character approach captures that. Again, the model doesn’t actually “understand” the concepts of rhyme or alliteration. It’s just learning the rules of a language---the version of English used in hip hop---and is able to extrapolate that this language likes having the same sequences of characters precede (rhyme) or follow (alliteration) the character “[space]”. I also wanted the model to occasionally generate novel words (because hip hop artists often innovate novel words), and that would be difficult to build into a word-by-word approach. That’s also why the model sometimes spells things wrong.

But how does any of this work, under the hood?

If you’re into this kind of thing: my model is a long short term memory recurrent neural net (LSTM-RNN) built using Keras (and a Theano backend). I trained it on ~5M characters from my dataset for 30 epochs on an AWS GPU, because my MacBook Air was not designed for this shit. I only trained a fraction (~6%) of the data because I had to pay for server time, and the entire keratized dataset is huge (~40GB), so training the whole thing would have been expensive. Eventually, I’d like to tweak the model a bit, retrain it on a larger portion of the data, and see where that goes, but that will have to wait until I have more free time, or a salary.

If that was interesting, feel free to get your hands dirty with the code.

If you’re curious about how neural networks make (statistical) judgements in general, the answer is calculus and linear algebra. A lot of calculus and linear algebra. If you can stomach remembering how to multiply a matrix and take a derivative, this is a rabbit hole worth going down. Seriously. Something you learned in high school is powerful enough to (mostly) mimic human complexity. It’s beautiful.

0 notes

notorious-nlp · 9 years ago

Text

NEURAL NETWORK-WORK-WORK-WORK

For today’s installment, I built a model (a neural network) that automatically generates novels hip hop lyrics. Here are some samples. Bolded text is seed data:

1. [and she be like oh yeah we turning up oh ye]ah neg the world cause she too good to be that bitch that she said i ya boo you get to see what you say i m alreadey that so got you to belleve it

oo the block i m at gittt to say it bring to the buot to the boot that s what i m saying

2. [cry me a river that leads to your ocean you never see me fall apartin the wor]ld she say i m a cock of the same the way i m all hot is she sane i m a crinck

the way i m all hot what she bout the hood is the hood a coope or the soiree i m all hot is she bout the brother i m a cock of the same i m a cock is she tamed

Obviously, these snippets aren’t perfect. There are some spelling errors (alreadey), and some parts don’t just “feel” right. But other parts are really good. “neg the world cause she too good // to be that bitch that she said i ya boo” is my favorite, though I’m also partial to “i m a cock is she tamed”

Later I’ll post an in-depth write up of what this model is and how it works. But, for now, it can (almost) give us hip hop snippets! Stay tuned for more snippets as the model spits them out.

#machine learning #hip hop #neg the world #neural networks #natural language processing #linguistics #magic #women in tech #women in stem

0 notes

notorious-nlp · 9 years ago

Photo

Wu-Tang Clan word clouds, raw and weighted by LLR.

Explore the code on github or read about the analysis.

0 notes

notorious-nlp · 9 years ago

Text

Log likelihood it raw

Here are some Wu-clouds, clouds built from the discographies of individual Wu-tang members (specifically: Ghostface Killah, Method Man and ODB). See if you can figure who they belong to (spoiler alert: there’s no twist).

We’ve already talked about why these aren’t great representations of an artist’s vocabulary: they’re confounded by things like language choice (some words get used a lot in English--that’s why you see got in all the cloud) and genre (some words get used a lot in hip hop--that’s why you see a lot of shits and fucks). These clouds look pretty similar, and good data visualization answer more interesting questions. For example, I care much more about how Ghostface and Method Man’s favorite words compare than I do about how many times Method Man raps the word “yo”.

To get at that, we’d need to weigh each word by something like a hip-hop score, but specific to Wu-tang and it’s members (and which can take into account that all members are hip hop artists). Thankfully, if we’re willing to get our hands dirty with probability theory (oooooh, baby), there’s an easy way to do that: for each word that an artist uses, we determine it’s log-likelihood ratio (LLR). Bear with me while I outline this.

Basically, we can use LLR as a way to see how words are distributed across different artists’s vocabularies. To get the LLR of Inspectah Deck’s use of the word parole, I need to see how both Inspectah Deck and everyone else uses the word parole, to figure out if the word has the same level of “importance” across everyone’s vocabularies. Doing this means figuring out the probability that any random word in an Inspectah Deck song will be parole (it’s about 0.0001%) and comparing that to the probability of any random word in a Method Man song being the word parole (it’s about 0.00002%), and so on. After this comparison, we get a score -- that’s the LLR. There’s a little bit of extra math needed to go from probabilities (chances that something might occur) to distributions (representations of occurrences; and there’s a fairly mild formula for that), but that’s the gist.

Here are word clouds from the same artists, with words weighted by their LLRs. Now it’s really obvious who’s who, and there’s nary a yeah or shit in sight.

#log likelihood ratio #nlp #natural language processing #hip hop #wu tang clan #probability #science side of tumblr #science side of the internet #method man #ODB #ghostface killah

0 notes

notorious-nlp · 9 years ago

Photo

Hip hop artists, by vocabulary size and a totally arbitrary measure of how hip hop the words they use are (compared to the words other artists use).

Explore the code or read more about the analysis.

#arbitrary science #scatterplot #graph #matplotlib #matplotlib is terrible #i mean it works #but god is it ugly

0 notes

notorious-nlp · 9 years ago

Text

Pushin’ Weight

Last time, I gave you a Drake-cloud that showed which words occurred the most frequently in the hip hop corpus. We got some expected values—name some words more (stereotypically) associated with hip hop than bitches, hoe, and go hard—but this information wasn’t actually computationally interesting: it was interesting to me because I’m a human who understands the context in which hip hop exists. This means that I can understand why it’s potentially meaningful that the world black occurs frequently, and ignore occurrences of the word can’t. To really get somewhere, we want to teach a computer—a stupid box of circuits with basically zero knowledge of anything—to be able to draw the same conclusions a person can.

That data also wasn’t necessarily linguistically interesting. I’m writing a dissertation right now, and I use the words look, even and still (all featured in the Drake-cloud) pretty often there. That’s not because I’ve decided to try to write my thesis in the style of hip hop (I, uh, haven’t); it’s because my dissertation is written in the same language as the hip hop corpus, and some words are just important to English. We’d also like to teach a computer to be able to ignore these words.

To figure out which words are important to hip hop, we need something more subtle. Following Iain Barr’s example (exactly, so that we’ll have comparable data at the end), I assigned each word an arbitrary measure of hip-hop-ness, by comparing its distribution in the hip hop corpus to its distribution in a comparison corpus. Here’s the formula I used:

What that mash of symbols means: I counted the number of times a given word appeared in the hip hop corpus and in a comparison corpus, then divided to get a raw score. Then I took the log of the raw score, to tighten up the output range. I wanted to deal with hip hop scores ranging from about -10 to 10, not .0001 to 70,000.

Picking a comparison corpus is actually a pretty loaded task. In an ideal world, we’d want something that perfectly reflects how (American) English is used by everyone (across, e.g., all dialect lines). I don’t think something like this truly exists. So, because sometimes you’ve just gotta hold your breathe and dive in, so I picked the Brown Corpus for comparisons.

Without further ado, the least and the most hip hop words in my corpus:

Not a look, still or even in sight! The Brown corpus was actually a better comparison source than I expected. It included some non-standard English words, so this (very simple) model was already able to capture the fact that dropping the final g in a word (particularly a verb) that ends in -ing is a very hip hop thing to do. The Brown corpus wasn’t perfect. It didn’t, for example, include about 140k unique words from the hip hop corpus, and that’s a big number. But the hip hop corpus has over 8 million unique words in it, so this is only about a 1.75% gap. There are a lot of important words in that gap, but there are also a lot of misspellings (unlukcly) and transcription errors (causedthe). We could do a lot worse.

I also gave a hip hop score to artists overall, based on their vocabularies. This hypothesized that artists who rap about primarily favorable expenditures and the onset of nationalism in Laos will have a lower hip hop scores than artists who rap about a lil bitch tryin’ to get dope weed and shit. I was curious how this would compare with the vocabulary metrics from the last post (which coarsely distinguished between indie and mainstream artists), so I graphed the two against each other. See a bigger version here:

Here’s how that graph words: hip hop score is the x-axis, vocab size is the y. Each dot represents one of the 300+ artists, about 45 of those artists are labeled, There’s a weak correlation here between large vocabulary size and a lower hip hop score, which makes sense (there can only be so many words with high hip hop scores). We have some expected outliers, too: Pitbull, who straddles the genre line between rap and reggaeton, has a hip hop score on the lower end. Wu-tang also continues to be mathematically interesting: it’s hip hop score is close to an average of all of it’s members (the graph only shows Raekwon, Ghostface and GZA, but the rest fit the pattern too).

#notorious-nlp #wu-tang #data science #natural language processing #science side of tumblr #hip hop #linguistics

0 notes

notorious-nlp · 9 years ago

Text

Meet the corpus

What does natural language processing have to say about a corpus of hip hop songs?

First, the corpus itself: I wrote a web scraper that downloaded lyrics from about 27,000 hip hop songs (juuuust under 13,500,000 words), or the discography of about 300 artists. I deferred to experts (”Best of” lists, etc) or algorithms (thanks, Spotify) when choosing artists for inclusion. It’s worth noting that this isn’t a big corpus, by NLP/Machine Learning standards (as a point of reference, Google recommends corpora with words or character counts in the billions for training their word2vec model), and I may have to expand it later, but let’s see how far these 27k songs get us now.

First they get us two word clouds, one in the shape of a generic clip art break-dancer, the other in the shape of Drake:

These are representations of pure word frequency (excluding very common stopwords like “the” and “in”) across all songs and artists, and it’s up to you to decide whether or not they can get at what hip hop is “about”. This data is also fairly unprocessed: I didn’t stem anything (so “hoe” and “hoes” are treated as two unique words) or collapse standard and non-standard spellings (so “dog”, “dogg” and “dawg” are all understood as distinct words). This is a pretty shallow overview; it just counts the data.

Counting can still get us interesting places, though. Inspired by Polygraph, I looked at measures of vocabulary size across artists. Immediately: I don’t expect that Polygraph and I will have identical results, because we aren’t working with identical data sets or methodologies. The data in the Polygraph viz only looks at prolific artists, and regularizes itself by only looking at an artists’s first 35,000 lyrics. My corpus isn’t structured like that: it includes data from recent artists with fewer albums (Desiigner, gnash) as well as pop culture mainstays like Jay-Z and Kanye, who claim massive songs lists. Counting up unique words across an artists’ collected works in my corpus would unfairly disadvantage newer artists and unfairly bolster more prolific ones.

To deal with this, I looked at the ratio of unique words in a song to total song length, averaged across songs. This carries the (potentially flawed) assumption all song will be equally representative of an artists’s preferred vocabulary size (and that number of unique words per song is a good mask of vocabulary size), but also allows me to compare artists who have released 400 songs with artists who have only released 10 (this method will, in fact, slightly privilege newer and less prolific artists, since it’s easier to skew averages across a smaller dataset). If I were very careful, I could have also normalized the data by song length (seconds) and speed (an artist’s average words/minute), because we’d expect an 8 minute song to have more unique words than a 3 minute song (same for a 3 minute song with a 1000 word song vs a 3 minute song with 100 word song), but I didn’t build that kind of meta-data into the corpus.

Here’s a chart sampling the ranking this metric produced. We have three artists from the bottom (smaller vocabularies), five from the middle (average vocabularies) and three from the top (larger vocabularies). What this chart says, basically, is that if we take any random 100-word sample from a Madvillain song, we should expect 59 of those words to be unique. With a similar sample from a Destiny’s Child song, however, we’d only expect 29 unique words.

This metric more-or-less replicates the Polygraph findings, at least at the high end. Aesop Rock (.562) is still in the top 3% of artists, as are Cunninlynguists (.556), keeping the likes of Madvillain (.586) and Earl Sweatshirt (.559) company. Notably, Polygraph’s Wu-Tang cluster still dominates the top: GZA (.553) and Ghostface Killah (.536) are in the 94th and 93rd percentiles (vocabularies larger than 94 and 93 % of the sample population), respectively while Raekwon (.527) and Wu-Tang (.521) are the 91st and 90th percentile. Method Man brings up the rear in the 85th percentile (.5), and that still ain’t nothing to fuck with.

It’s different at the other end. Artists with lower vocabulary scores include wildly popular names like Destiny’s Child (1st percentile, .291) and Beyonce (3rd percentile, .340), Rihanna (1st percentile, .316), and Aaliyah (4th percentile, 0.346). Seven out of 10 of the lowest scoring artists are female, but we see a lot of popular men down there, too. Shout out to Cee-lo (0.348), The Weeknd (0.351), and Fetty Wap (0.351), who are all in the 5th percentile, right next door to Usher (6th percentile, 0.352).

My gut says that this trend tells us more about differences between sub-genres than anything else. Artists writing for radio play are working with song-length constraints (mentioned earlier as a potential bias) and repeating lines make for catchy hooks. There’s also the possibility that commercial labels like Columbia would give artists less creative freedom (or push their artists towards a pre-defined sound or song structure), compared to indie labels like Stones Throw or Tommy Boy. If this effect is real, it could be reflected here, but that’s all speculation. I’d be interested to see if this trends holds up across other music genres, though.

#notorious-nlp #vocabulary size #science side of tumblr #natural language processing #word cloud #hiphop

1 note · View note