#which is relatively light and fluffy and plot setup | Explore Tumblr posts and blogs

data-monkey · 6 years ago

Text

AO3 stats project: correlations

Okay. Now for some really fun stuff: correlations between different categories of metadata (like, correlations between ratings and tags, character tags and genre tags, etc). And as with the previous post in this series, some of the content I discuss may not be strictly work-appropriate.

Thanks to @eloiserummaging for beta reading these posts; any remaining errors are my own. A Python notebook showing the code I used to make these plots can be found here.

All right. We've got a list of top tags now. How do those tags relate to each other? That is, if I have a work labeled "Fluff", does that change how likely it is that that work will also be labeled "Angst"? I'm plotting, here, a matrix that answers that question directly. I can compute how often "Fluff" and "Angst" would appear together if they were just randomly assigned to all 4.3 million works that I collected metadata for. The blocks I'm showing below are colored by whether the actual number of times "Fluff" and "Angst" appear together is greater or lesser than that expectation. (We do it that way, instead of just counting the raw number of connections, because otherwise it looks like everything is correlated with "Fluff", just because there are a lot of works labeled "Fluff" in the data set.) If I pick two tags--say, "Fluff" on the bottom, and "Angst" on the left--I can follow them to where they intersect, and the color of that little block tells me about whether they're correlated. Things that are pink are less likely than you'd expect to appear together, while things that are green are more likely. The diagonal line is always that pale green-grey color because, by definition, "Romance" appears with "Romance" exactly as often as you'd expect, and the graph is symmetric across that diagonal line because it doesn't matter if we take "Fluff" then "Angst" or "Angst" then "Fluff."

So one really interesting thing is that this plot is mostly green (so things are correlated). Not only do these tags appear a lot, but they appear together more than you’d expect, and even when they’re anti-correlated--that is, when being labeled one makes you less likely to be labeled another--it’s not by very much. The strongest correlation is between “Romance” and “Humor”, so I guess the rom com is alive and well! Angst and hurt/comfort also appear together a lot, which I suppose makes sense.

We can make correlation plots like this for other things. How about pairing types?

Per the AO3, “Multi” means “more than one kind of relationship, or a relationship with multiple partners”, and “Other” means “everything not covered by the other labels”. The structure of this is kind of interesting, too. Remember, pink means things are anti-correlated (appear together less often than expected) and green means they’re correlated (appear together MORE often than expected), so M/M is the Lone Ranger of pairing types, making all other pairing types less likely if it's included. Even more than Gen, which you’d expect to exclude other categories!

Now for the REALLY fun stuff: how do all of these things correlate with other things? Here's an obvious one: ratings vs tags. No more symmetry, because we're showing different things on the two axes.

Not too surprised by this: “Smut” has a really strong relationship with rating, because most works of erotica deserve the higher ratings. The other tags are much less correlated with rating; Fluff and Humor incline to lower ratings, Established Relationship to higher ratings, and the others are kind of in the middle.

Do ratings and tags correlate with pairing type?

Hmm. Interesting. Romance is way more likely to be F/M than you’d expect. (Do we not write as many M/M romances, or do we just call them something else? A couple of friends also pointed out to me that this might mean “romance” as in “the publication genre of romance” not as in “romantic plotlines generally”, which makes sense and would make them more M/F-heavy given publication trends.) Established relationship is very skewed to M/M. Gen anti-correlates with most of the tags you’d expect. Apparently only single-pairing romantic relationships can be fluffy. F/F is neither as funny nor as angsty as chance would indicate.

I think this pattern can be explained this way: Gen is way more likely to be a low rating, and the trend you see for most other things is just that we’re comparing to the average--if Gen is way more likely to be rated General Audiences than is typical, then the other pairing types have to be slightly less General Audiences than you’d expect to make up for it. (This argument doesn’t necessarily apply to the other correlations I was showing, because in those plots there are a bunch of tags I’m not showing and because non-ratings tags can appear together.)

How about the top relationship tags--do they correlate with anything?

Huh. Well, most of these ships have preferentially high ratings--I think that’s the same effect as in the rating and pairing correlation: things without a romantic/sexual relationship have lower ratings, so on average the works containing relationships will have a higher rating. That’s not universal--look at Magnus/Alec, for example--but it’s common. The other two obvious things here are 1) Dean/Sam really skews to high ratings, 2) apparently Harry/Louis fans reject the rating system.

Okay, that’s...less interesting than I was expecting. Lots of Harry/Louis smut, lots of Keith/Lance modern AUs. The most likely established relationship is Derek/Stiles. Magnus/Alex and Yuuri/Victor are the fluffiest. Not much romance, except for Draco/Harry, and that pairing also has an unusual amount of humor.

What about character tags?

Hmm. Looks like there are a lot of teen-rated Marvel works, and Supernatural leans towards the higher ratings (which we already knew).

This basically just repeats stuff we already noticed in the relationships plot, I think. One thing I didn’t notice up there is that John and Sherlock are not that likely to be tagged in Established Relationship works, which is kind of interesting as they’re long-term partners (not necessarily romantic partners) in most versions of the canon.

How about characters and pairing type...are there characters that appear more often in one kind of pairing than you'd expect based on randomness? (Note that all these characters appear most in M/M stories, because those are by far the most common--this is just asking a relative question about how much they appear in other kinds of stories.)

Mostly not super interesting, I have to say. Steve, Tony, Natasha, and Harry Potter are all more likely than usual to appear in poly relationships or in F/M stories, apparently, and Stiles and Castiel are less likely than usual to appear in gen works.

Finally, a really fun thing (that I have to link to an external site to do, because tumblr doesn’t like javascript in posts). Instead of just looking at the top 10 tags, here are the top 100 tags portrayed as dots, arranged in a connected graph: the dots represent tags (with the popularity of the tag represented by its size), and they’re connected by lines whose thickness indicates how often the two things appear together, relative to chance. You can also use this kind of setup to work out sets of interconnected tags that are more closely tied to each other than to the other tags. I colored those sets with different colors, so you can identify them. Hovering your mouse pointer over a dot should tell you which tag it is.

Here are the blocks of tags that the algorithm found:

Alternate Universe, Alternate Universe - College/University, Alternate Universe - High School, Alternate Universe - Human, Alternate Universe - Modern Setting, Alternate Universe - Soulmates, Christmas, Crack, Cute, Domestic Fluff, Fluff, Fluff And Humor, Humor, Light Angst, One Shot, Romance, Tooth-Rotting Fluff

Angst, Blood, Canon-Typical Violence, Canonical Character Death, Character Death, Dark, Death, Depression, Emotional Hurt/Comfort, Grief/Mourning, Hurt/Comfort, I'm Sorry, Magic, Minor Character Death, Nightmares, Post-Traumatic Stress Disorder - PTSD, Sad, Self-Harm, Suicidal Thoughts, Torture, Violence

Action/Adventure, Alternate Universe - Canon Divergence, Canon Compliant, Character Study, Crossover, Drabble, Drama, Family, Friendship, Future Fic, Post-Canon, Pre-Slash, Spoilers

Alcohol, Alpha/Beta/Omega Dynamics, Established Relationship, Explicit Language, Explicit Sexual Content, First Time, Fluff And Smut, Jealousy, Kissing, Mpreg, Polyamory, Sex, Sexual Content, Slash, Smut

Anal Fingering, Anal Sex, Bdsm, Blow Jobs, Bondage, Dirty Talk, Dom/Sub, Dubious Consent, Hand Jobs, Masturbation, Oral Sex, Plot What Plot/Porn Without Plot, Rimming, Rough Sex, Spanking

Angst With A Happy Ending, Developing Relationship, Eventual Smut, Falling In Love, First Kiss, Fluff And Angst, Friends To Lovers, Friendship/Love, Happy Ending, Implied Sexual Content, Love, Love Confessions, Mutual Pining, Other Additional Tags To Be Added, Pining, Slow Build, Slow Burn, Swearing, Unrequited Love

I love this! To me, those sets look like: fluffy plot-based tags, violence and disturbing content tags, more action-oriented plot-based tags, less explicit vanilla-ish erotica tags, more explicit or kinky erotica tags, and romance. That’s so cool. (Not everything makes sense--why is Drabble where it is?--but still, cool.)

Or in other words: some numerical routines correctly identified the porn. :)

Up next: what gets kudos?

#fan stats #ao3 #fanstats

32 notes · View notes