ao3datafan - Tumblr blog

ao3datafan · 6 years ago

Text

Aggregate Data and Meaningful Conclusions: A Response to Fansplaining’s Fandom Study

Before I get into this, I just want to give a quick update on a few things.

1) I am still looking for a new home for the AO3 DataFan project. Pillowfort may be viable in a few years, but it’s not currently viable as it lacks a lot of the key infrastructure needed to make these posts (and it’s chunky and buggy as hell but that’s a different problem). If you have suggestions, shoot them my way!

2) WOW Look at all these new followers! Where the heck did you all come from? Oh my gosh! When I started, I thought I’d maybe get around 50 or so followers, most of whom would be acafans themselves or adjacent.

3) I’m having all sorts of “fun” (read: horrible) times with my current analysis on the taxonomy of Big Name Fans. I knew it would be a tricky question to answer when I started (but honestly, why not try to attempt it anyway?) but I didn’t quite anticipate HOW complicated this would be. Haha! I’m actually working with a non-fandom involved coworker on how to do some *insert technical data science talk here* with the data to see if we can’t get a conclusion from the data we have. However, we’ve both come to a very water is wet conclusion; BNFs are not defined solely by the popularity of the work they create. More on that when I finish the post.

Okay, with the announcements out of the way, let’s talk about data collection and fandom analysis, and why Fansplaining’s fandom study has left me feeling a little let down.

So I could have sworn I’d talked about univariate analysis before, but since I can’t find where, I’ll recap it for you here. Univariate analysis is analysis conducted on a single variable. It’s relatively straightforward (and boring) and it usually lends itself to making pie charts and bar graphs. This has been the prevalent trend in fandom analytics for a few years now. Blogs like fandometrics and projects like FandomStats use univariate analysis to reveal information about fandom on the broad scale. Most often, univariate analysis is preformed on what’s called aggregate data.

The process of aggregating data and performing analysis on that aggregation does have it’s uses, but it’s largely the problem with fandom analytics that I started this project to address. Aggregation and univariate analysis can tell us the “WHAT” of the data, but it can’t tell us the “WHY/HOW.” Take, for example, FandomStats queries.

Above is a query on the “Fluff” tag on AO3. It’s… nice. You can see how authors chose to rate their fluffy works and if you read further down you can even see which fandoms are the “fluffiest” in terms of works. On the surface, there’s nothing wrong with this. It’s a very solid “WHAT” answer to the data. “What are people writing?”

Compare this to my analysis of “Fluff” versus “Angst” tagged works.

In this post, I used two variables to make a comparison. I tracked an individual work’s hit count compared to it’s uses of either the “Fluff” or “Angst” tag. It’s actually still considered univariate analysis when you do this, but the difference is that it’s performed on non-aggregated data. In this case, you can compare how the tags influence the hit count of a story. It answers a “HOW” question. “How does a work’s tagging effect it’s hit count?”

For an example of multivariate analysis, see also my post on the relationship between the length of a fic and its hit count which attempts to answer “How does the chapter length affect hit count?”

Okay, so what does all this have to do with Fansplaining’s Fandom Study?

As I showed earlier, aggregate data can only be used in univariate analysis. It makes good pie charts and bar charts, but not much else. Aggregate data can tell you “WHAT” something is, but it can’t tell you “HOW” or “WHY” something is. In order to get to the heart of those, you need to know how data points interplay with one another. To do that, you need individual data points. Like an individual work’s metrics or an individual’s responses to a survey.

Which is why I’m slightly annoyed with Fansplaining. Dear Flourish and Elizabeth – you conducted one of the largest surveys of fanfiction reader’s habits with 7,500 individual data points of users ranking their fanfiction reading preferences. I would literally KILL to be able to do that! That is a GOLD MINE of data.

So why the hell did you decide to aggregate it all and only release the aggregated results?

Now I want to be clear that I am not bashing Fansplaining’s study or their thoughtful and well written article explaining their results. They did a decent job even if it is frustratingly banal to someone like me who wants to understand the interplay between data points. It’s especially frustrating that the article itself even asks the kinds of questions that multivariate analysis can answer.

And while mpreg is widely disliked, pregnancy in general is met with a ¯\_(ツ)_/¯—a highly suggestive difference. We’ve got a lot of theories on why, but they’ll need to wait; it deserves a lot more space than we can give it here.

You know what would be a good start to answering this question? Knowing the demographics of your participants. And by demographics, I don’t mean whether they’re male, female, non-binary, black, white, Asian, young, old, etc. (although I would kill to know that stuff too). What I mean is knowing how they answered other questions in the survey. For example, is there a high correlation between people who answered “Yay!” for both mpreg and pregnancy? Did people who enjoy mpreg also tend to enjoy some mpreg-adjacent tropes such as omegaverse? Is pregnancy met with a warmer reception by people who prefer relationships involving a female character (het or F/F for example)? You COULD read any freeform text comments on the survey and attempt to get some answers from that, or you COULD do multivariate analysis. Or, even better, you could use that fancy data science technique called NLP and do both!

¯\_(ツ)_/¯

There’s also some data I would have killed to analyze on a purely selfish level. As a Domme in the BDSM community, I’m keenly interested in the interplay of power and relationships – who has power, how did they get it, and how do they maintain it? – so when Fansplaining reported that “slavery” was an almost universally reviled trope, I really wanted to know more about the psychology of why that is. Again, multivariate analysis could help us identify the relationship between how people feel about “slavery” tags and how they feel about other tags and tropes. For example, people who hate slavery might feel strongly negative towards fics about racism, which can be an indication that they dislike the implications of chattel slavery or that the trope hits closer to home than they want to deal with when they’re enjoying their leisure time. On the other hand, people who like slavery might come it for much the same reason I do – because it’s an interesting study of how people negotiate power and relationships in an inherently unbalanced system. In that case, they may also enjoy omegaverse, prostitution (wherein the power is in the exchange of sex and money), or even teacher/student fics.

(I also really want to know what’s going on with centaurification. I’ve been in the BDSM community for 8 years and the online fandom community for 18 years. I thought I had seen it all, and yet I am completely stumped about what centaurification is.)

Alas, I may never have the answers to these questions. In my professional life, I’ve had clients hand me datasets upon datasets of aggregated data and then ask me to use sophisticated machine learning/artificial intelligence to glean insights for them. I’ve always managed not to laugh in their faces even if my eyebrow is developing a bit of a twinge. Instead, I patiently explain to them what I just explained to you guys, my wonderful dear followers. Maybe if I explain it often enough, someone will gift me with raw, unaggregated data of one of these surveys.

A girl can only hope.

(But seriously, I really would commit murder for a copy of the raw survey results.)

#data science #ao3datafan #fansplaining

94 notes · View notes

ao3datafan · 7 years ago

Text

Breathe, We’ve Been Here Before (An Open Letter to Fandom)

Take a deep breath, we’ve been here before. I joined fandom online in 2001 and since then have been an active engager with fandom communities as a fanfiction writer. In my seventeen years as a member of the online fandom community, I’ve been through two other similar purges – the Lemon War and Strikethrough. In 2003, Fanfiction.net decided to purge all NC-17 content from its website due to pressure from advertisers. In response, the websites Mediaminer.org and Adultfanfiction.net were created to fill the void. AFF itself was a direct response to the 2003 purge known as the Lemon War. In 2007, again due to pressure from advertisers, Livejournal removed blogs and communities that were tagged with “problematic” tags such as rape, abuse, and incest. In the process of deleting MUCH of then-fandoms existence in this unannounced purge, Livejournal also deleted survivor’s groups and a community of online support for sexuality and trauma recovery. Dreamwidth was created in response to Strikethrough, while the Organization for Transformative Works was founded in 2007 during fandom’s “Never Again” moment with the goal of offering safety and haven for displaced fanworks.

You may know OTW by it’s flagship project, Archive of Our Own, which this blog studies for fun, but OTW is also instrumental in the protection and engagement of fandom communities in the larger world. OTW’s legal team regularly writes articles to courts of law defending fandom communities from copywrite law as fair use, and defends fandom communities from poorly written and inept laws that restrict fandom’s ability to participate in transformative culture. Your memes, your fanfiction, and your fanart are all examples of things that OTW serves to protect. When you donate to OTW, you not only keep AO3 alive and free from advertisements and commercialization, you also provide support to OTW’s legal services and their myriad other projects such as Fanlore and Transformative Works and Cultures (disclaimer – the paper I’m writing currently is for TWC).

Breathe; we’ve been here before. Several times in fact. It’s been 11 years since the last major fandom purge happened so I know for many younger members of fandom it can feel scary and overwhelming, but fandom will survive; fandom will recover. However, Strikethrough and the Lemon War left scars on fandom, scars that Tumblr’s purge will also leave. Whole communities will be lost and scattered to the winds. Friendships will change, and the online nature of fandom will be irrevocably shaped by the fall of yet another community. However, just as this is not the first time, this is also not the last time. As long as commercialization of fandom exists, we will continually be chased from our homes on the internet and need to find new homes. But we have persisted in the past and we will persist again.

Like many of you I am disappointed by Tumblr’s decision to wipe out “NSFW” content via an auto filter that seems to think my alternative fashion posts are porn (really???), but I am unsurprised. I knew this day would come, and I’ve wondered for a long time when fandom would leave and where it would go. The answer to the first question has come, but I’m not sure where we will go. However, wherever fandom makes its new home, I will follow, because fandom is my family and my home.

Thank you to all my followers and my apologies to you – in light of Tumblr’s decision to ban NSFW content and in solidarity with the affected blogs and members of fandom, AO3DataFan will no longer update on Tumblr. My current plan is to move to Pillowfort for the time being after the start of the new year. However, wherever the majority of fandom lands, I will follow and AO3DataFan will return.

Over the next few weeks, I will be archiving posts, answering asks, and preparing for the purge. You will see answered asks for DataFan questions simply saying “Acknowledged” to let you know that I have seen your ask and am planning to answer it in the future on the new website.

In the meantime, if you need to contact me or want to ask a question, I can be reached via email ([email protected]) or on Discord (lockea#2638).

Breathe; I know this is an uncertain time for many of you, but take it from a fandom elder – we have been here before, we will be here again, and we will always survive. Somehow, some way.

All my love,

Lockea Stone

#admin #goodbye tumblr

321 notes · View notes

ao3datafan · 7 years ago

Text

Not Dead, Just Tired

Hi all data fans! Welcome to all the new followers and my apologies for no data posts in the last month. I moved across the country so it’s been a very, very busy time for me. I have officially become a Californian. And it’s... well... it’s a trip.

That said, I’m still tracking to have the Markov Chain “Perfect Fanfiction” experiment finished by Christmas, so expect to see some very silly fanfiction generated in the next month or so. The LSTM is also coming along nicely, but I won’t have anything to show for it until March or so.

I’m working on two notebooks right now -- Genomic Signatures (what defines a BNF?) and Best Times to Post. I plan to get back to those as soon as I get my tower back up and running here in the next couple weeks.

In the meantime, remember that if you have any questions for Data Fan, please feel free to ask. Or, if you simply want the notebook repository, shoot me a signed ask and I’ll reply with the link.

Thank you all for your patience! :D

#admin

19 notes · View notes

ao3datafan · 7 years ago

Note

Aaah finally some big data! Your blog looks great i cant wait to read the rest of your posts! Out of interest what program do you use for your stats? Do you have a background in data analysis or are you just interested? 😂

Hey nonny! Thanks for the ask!

I use python in Jupyter Notebook for my analysis, and the graphics are made with Plotly. The statistics is usually either Scipy Stats or Pandas. I have a github repo where I’m trying to remember to post notebooks as I complete them, but because it’s under my real name I’m only sharing the link with logged in users via private messages/private ask replies.

My background is actually in robotics and autonomous vehicle systems, so big data but in a very different way. Recently I did a career swap to data science so that’s my job title. I work for a Fortune 500 company as a data scientist focusing on sensor processing and computer vision -- so stuff I used to do when I was a roboticist. It’s actually really cool what the overlap between data science and AVS is. I’ve done everything from help find which bridges and overpasses are most likely to collapse so that preventative maintenance can be performed on them by the Department of Transportation, to apply machine learning to a American Sign Language translation app, to helping government and humanitarian organizations identify critical disaster zones using satellite imagery so they can triage NGO support after a disaster.

Do I have any formal data science training? No, not really. A lot of big companies are going through workforce modernization efforts and training their staff as data scientists, or hiring people whom the company can train as data scientists, and I got picked up in the latter group by my current company. I actually just tested out of mandatory data science training at work (in part thanks to the “self study” I’d done for DataFan) so I guess I’m not a hobbyist, but I wouldn’t say I have a background in it either. XD

11 notes · View notes

ao3datafan · 7 years ago

Note

I'm not sure if this question kind of counts but I was kind of wondering if there might be a correlation between how popular a Fic is and how long the individual chapters are. Like whenever they update the chapters are like thousands of words long rather than tiny short ones.

Hey there! This is absolute a question that “counts”, albeit one I’ve already answered. Check out the post: Do longer chapters mean more hits?

The answer is actually pretty complicated. It looks like SHORTER chapters have a higher probability of generating more hits per chapter than LONGER chapters do. However, the correlation metric is pretty low (0.34) so it’s more likely other factors contribute MORE to a work’s hit count than just the length of your chapter. Completion, how long ago it was posted, the type of romance, etc are all correlated with hit count in statistically significant ways.

Does that mean that’s the be all, end all answer? Well, no, actually. Simple regression analysis can’t tell us what effect chapter length has on hit count, but machine learning can! It’s on the back burner for the time being, but I’ve been working off and on training a Random Forest algorithm to see if we can determine factors that contribute to hit counts.

(And for the Big Data nerds who are following my blog -- I tried SVR but it was taking upwards of 5 hours to train so I’m swapping to a categorical classifier instead. Haven’t touched it since I gave up on SVR though, so we’ll see what the outcome is at some point in the future.)

Thank you for the ask! <3

9 notes · View notes

ao3datafan · 7 years ago

Text

Fluff Versus Angst: Which is More Popular?

Hi Everyone! I just took a look at my asks and I’m so excited to see not only all the new followers but the new questions! You guys are fantastic! The next ask on the list is actually fandom specific, so it’s taking me a little while to query and clean the data, but that one will probably come up soon. I’m out of town on business and pleasure these next two weeks. If you’re in the Washington DC area, Lockea from DataFan is presenting at Anime USA this weekend! Check out “How to Write the Perfect Fanfiction!” Otherwise, unfortunately it could be a little while before I write my next article.

And now, the question I’ve been dying to know the answer to; which is more popular; angsty fics or fluffy fics?

I love that there’s clearly a top 10% of fandom that just dominates everything. There is no reading the statistical distribution of popularity because the number of outliers kills the kde. So to do this test, I took all data that fell beneath the 90th percentile and analyzed it, about 27,700 fanfics which tagged themselves as fluff and 17,000 fanfics which tagged themselves with angst.

So lets look at the bottom 90% of the population.

Ahh, much better. We see that fluff has a more even distribution of hits across the fics while angst has a spike. Likewise, fluff fics had a higher average number of hits than angst fics (821 hits to 730 hits). So on average it looks like writing fluff is not only more popular among writers (again, noted by the larger representation in the sample size) but also more popular among readers. But is it significant?

In this case I conducted a student’s t-test for independence of samples. A large P value means we cannot assume the means are independent (meaning we have to assume either we don’t have enough data or that angst and fluff are not statistically different from one another in terms of hit counts generated). So what was the p value in this case?

p = 8.096400668874957e-41.

I’d say the test of independence conclusively says that Fluff is more popular than Angst.

I guess this angst-loving bunny had better find a hole to hide in. I’m in the minority here. T_T Anyway, if you want a quick boost to your hit count, consider writing fluff next time. Just remember to tag it.

#ao3datafan #archive of our own #viewership #ao3 tags

59 notes · View notes

ao3datafan · 7 years ago

Photo

Just for fun -- here’s a word cloud of all the tags in my data set. Also, it’s a stealth sneak preview for the article I’m working on right now.

Shhhh... natural language processing (NLP) is absolutely a part of data science!

#data science #archive of our own #just for fun

36 notes · View notes

ao3datafan · 7 years ago

Text

Are Bookmarks an Important Measure of Work Success?

This is one of those questions where the answer is in the eye of the beholder.

When it comes to wanting to know how your fic measures up to others, the stats tend to be everyone’s go to choice for a performance metric. If you haven’t noticed from my previous posts, I tend to use the hit count as my primary mode of measuring work popularity against each other. Still, let’s take a moment and look at how the data correlates to one another.

Here is the correlation matrix for how Hits, Kudos, Comments, and Bookmarks relate to each other.

While all metrics are positively correlated, it’s interesting to note that comments are the least correlated to the other metrics, with bookmarks being a close second. This means that depending on how you measure success, comments and bookmarks will not necessarily correlate. For example, if I measure success as the number of Bookmarks I receive, then Kudos and Hits are strong indicators as well of my success. However, if I measure comments as an indicator of success, Kudos, Bookmarks, and Hits are weaker indicators of success.

This is visually represented by probably one of my most abused plots -- the heat map.

The warmer the color means the stronger the indicator. Also, rainbows, because I like rainbows.

(I also want you all to know that searching for “rainbow” in tumblr’s gif search is a wild trip.)

All right, enough of that. Back on track. So the correlation plot/matrix can tell us how strongly correlated the metrics are, but they can’t show us what the data looks like. So for that I plotted the regression line for the relationship between hits and bookmarks.

I was somewhat surprised to notice that there’s no clear trends that emerge for if a work is complete or incomplete. But when you think about it, the linearity of the relationship means we’re tracking a ratio that is independent of a work’s completion status. Unlike in my last post where hit counts were dependent on completion status.

Okay, so even though we can see a vaguely linear trend and the correlation score TELLS us there’s a pretty strong linear trend of 1 Bookmark for every 100 hits, how does this distribution play out? Once again, it should come as no surprise that BNF’s play by their own rules.

The distribution plot above tells us how much of the data falls into a certain range. To get the distribution, I simple divided the number of hits a work had by the number of bookmarks and plotted it. The closer the number is to one, the more likely someone is to bookmark a fic when they click on it.

Unsurprisingly, BNF works tend not to fall close to the 1 range. In fact, they have worse Hit to Bookmark scores than the average fic. The top 10% of fics tended to average more than 400 hits per bookmark. There’s plenty of reasons for this of course, such as reader re-engagement, but that would take more analysis than I feel like conducting at the moment.

Still, we want to get a sense of the distribution of the data for the bottom 90% of fanfiction -- the average fanwork creator’s domain. So I took works that fell below the 90th percentile and visualized the hit to bookmarks ratio.

Oh good, this looks much better. The distribution is fairly normal with a skew towards the upper end. The stats are pretty good too: For the bottom 90% of fics, the mean is 142 hits/bookmark with a standard deviation of 87 hits/bookmark. Meaning for about 80-90% of fic writers, 50 - 220 hits per bookmark is about average. Remember, the more popular your fic is, generally, the greater the ratio of hits to bookmarks, meaning the LESS bookmarks you’ll have overall.

So really when it comes down to it, the answer is in the eye of the user. Are bookmarks really a useful metric for calculating fic stats?

#archive of our own #viewership #data science

36 notes · View notes

ao3datafan · 7 years ago

Text

Do Longer Chapters Mean More Hits?

Thanks to several users on Discord for this question, especially Anonymous Lawyer who inspired the question by asking “Who writes the longest fics?”

This is a relatively simple question without a simple answer. To answer it, I plotted the average chapter length versus the hits per chapter. This helped control for multi-chapter stories, which by their very nature tend to get more hits. as readers click on the story to keep reading.

Since we’ve previously shown that readership is statistically independent of a work’s rating, I decided to control for work completion and got some surprising results.

So there is a return on investment for chapter length and hits it looks like, where fics with longer chapters gradually see less hits than fics with shorter chapters. Likewise, there’s a clear trend for fics that are incomplete -- the boundary is lower.

One of the things I love about doing DataFan is I’m actually running into and solving some very unique real world data science challenges, such as how to handle data where there seems to be a clear threshold. That is, we can say that the following generally holds true for the trend:

Hits per chapter is less than or equal to some function of words per chapter.

So what we see is that while there’s some sort of relationship between chapter length and the hits per chapter, there are also other variables clearly in play. If Hit Count could be considered solely dependent on Chapter Length, then we’d see a clear regression curve.

What about number of chapter a fic has? Does that have an impact?

There’s two ways to tell this. The first is to create a bubble plot, where the size of the bubble approximates the number of chapters a fic has. Again, the color represents whether a fic is complete or not.

That’s... really not helpful except to show that incomplete works tend to have more chapters.

Okay, so the other way we can visualize the data is to make it 3 dimensional. For that, I added a third dimension -- number of chapters posted -- to the graph. Unfortunately, 3 dimensional data is not easy to display statically, but you can interact with it if you download the notebook. However, I’ve got a gif below of the data.

Again, there’s no clear relationship between chapter length, hits per chapter, and the number of chapters posted. In the gif above, the green points are incomplete works and the blue points are complete works.

So next I did a technique called dimensionality reduction and reduced the size of my dataset to contain only completed one-shots. That is, I removed all works that were incomplete and had more than one chapter, leaving me with a sample size of about 150,000 works still.

Again, we see that there’s a clear threshold for when chapter lengths don’t seem to generate hit counts, but again it’s not single variable.

It would be disingenuous to say that hit counts are independent of chapter length, but it’s also not true to say that writing short chapters gets you more hits. The data just doesn’t support that.

Rather, other factors contribute to the hit count of a work. So what are those other factors? Well... one thing we can do is calculate correlation. Correlation does not equal causation, as hilarious charts like the one below show:

Usually, you need some domain expertise to decide when a correlation is related to causation. That’s why data scientists are not JUST statisticians and programmers, but also people who have interdisciplinary understanding of the subject matter. If you’re reading this tumblr, chances are YOU are a domain expert in fanfiction, or at least enough of one to help discern correlation and causation from each other.

The graphic above shows the data for one column of the correlation matrix, which helps visualize how different data is correlated to each other. For reference, here is the correlation plot of a limited segment of the data:

Areas that are red show high positive correlation while areas that are blue show high negative correlations. In other words, there’s a positive correlation between the number of chapters posted and the fic’s word count. For our purposes, we want to only look at the first row of the correlation plot -- the hits.

To determine whether something is correlated, and by how much, we need to define a threshold. We’ll pick a threshold of +/- 0.01. So from our list, which data points do we keep? (ID and Work ID are the same thing -- they’re the unique identifier I use for each work)

Completion, ID, Work ID, Gen, F/M, Num Authors, Num Fandoms, Multi, M/M, Average Chapter Length, Num Freeforms, Posted Chapters, Words, Bookmarks and Bookmarks/chapter, Kudos and Kudos per Chapters, Comments and Comments per chapter, Hits per chapter.

As domain experts on AO3, the next step is to determine which of those metrics that have a high correlation are actually causation metrics. For example, Hits per chapter is a correlation, but it’s not the cause of higher hit counts. Same with Kudos, Bookmarks, and Comments, which are not factors an author has direct control over.

At first glance, it might be tempting to throw out the ID/Work ID (which are the same thing) but actually the ID can be useful. I use the same IDs that AO3 assigns to a work when it’s posted, meaning that LOWER ID numbers are works that have been on the archive much longer than works with HIGHER ID numbers. So we’ll keep that.

Gen, F/M, Multi, M/M, and F/F are pairing designators. High correlations mean that pairing designators have an effect on hit count, while low correlations mean there is little affect on hit count. We removed F/F as an indicator because it had a low correlation, meaning that people click on a fic whether or not it’s got WLW in it.

On the other hand, people were less likely to click on a fic if it had Gen (no pairings) or F/M pairings in it. M/M fics on the other hand have the highest positive correlation at almost 10x our threshold of significance, meaning it looks like people really do like their M/M.

Average Chapter Length and overall word counts were also significantly correlated with the number of hits. This goes back to what was shown earlier: there is a relationship, but it’s not single factor.

So to answer the question: The average length of your chapter is one of many factors that contribute to your hit count. The others include: completion status, how long ago you posted, pairing types, total word count, and how many chapters you have posted.

#data science #ao3datafan #archive of our own #viewership

251 notes · View notes

ao3datafan · 7 years ago

Text

Addendum: Violin Plots that Don’t Look Like Violins

I talked in my last post about Work Rating and Work Viewership and if there was a correlation, and then promptly my sleep deprived brain remembered there were two things I forgot to do: perform a hypothesis test to get the p value of my findings and fix my notebook so I could actually plot the distributions of the data sets.

“Lockea, what the heck am I looking at?” You might be asking. That is a violin plot. It’s like if a box plot and a distribution plot had a baby. Where you can see it’s wider at the bottom means more data falls into those ranges.

Also, get your mind out of the gutter. I know what you’re thinking.

So basically, this shows in graphical form what the tables I posted on my last post show -- the distribution of the data. A normal violin plot looks more like this:

Then again, AO3 might be the first time I’ve really worked with exponentially distributed data in my life. Geez, the number of outliers is significant. Anyway, you can see there’s a box in the middle showing where approximately 50% of the data falls, with lines showing the 90% range. Above and below that range are outliers.

Our data has a LOT of outliers and they are very, very weirdly distributed.

Which brings me to my next point -- testing for statistical significance of a data set (I.E., testing that a work’s rating has no impact on viewership). If you want to know more, feel free to look up “Hypothesis Testing in Statistics” but be warned -- it made MY head hurt and I’m a data scientist.

Anyway, the longest of stories short, the smaller the p value after a hypothesis test, the more likely it is that your data sets are statistically similar. Generally, p values of less than 0.1 mean your hypothesis is valid.

Okay, so remember how we performed the oh so accurate eyeball test on the data last time? If you forgot, here’s the raw data again for comments.

The eyeball test said that it looked like Explicit and Mature rated fics performed better than the other three ratings. I mentioned that this was not an accurate test last time and I’m going to prove it here.

So what were the results of the hypothesis test? Well, the Kruskal test was performed on all five ratings and returned a p value so close to 0 that it just printed to my screen as 0.0.

Okay, but what about T rated fics versus E rated fics? For that I performed the Mann-Whitney test and got a p value slightly larger than the Kruskal test -- 8.550951916627126e-18. That’s 17 zeroes behind the decimal place before we even get to a number.

Yeah. Considering we only needed 0.05 to be generally accepted as statistically valid I’d say whether your fic is rated Explicit or Teen is not going to affect your viewership in any significant way.

So with that I rescind my answer from my last post -- turns out sex doesn’t sell more on AO3. People will read your fic no matter how you rate it.

#data science #ao3datafan #archive of our own #viewership

13 notes · View notes

ao3datafan · 7 years ago

Text

How Does Work Rating Effect Work Reception?

Special thanks to Anonymoss Lawyer on Discord for the question!

Let’s start with something simple. What role does the rating of your work play on the reception of your work?

This is a somewhat open ended question so we’ll do some exploration to answer it. First, let’s talk about the approximate breakdown of fics by rating on AO3. Instead of querying the over three million works on AO3, I queried a sample of 20,000 works, mostly in the anime fandom, for this analysis.

Of the 20,000 queried, roughly 30% of all works on the archive were rated for general audiences and 35% were rated for teen and up. So how did the number of works perform against each other?

In terms of the percentage of bookmarks, explicit, mature, and teen rated works all received a higher percentage of the total bookmarks, meaning that a work with one of those ratings was likely to have more bookmarks than a work with a lower rating. You can read the other three pie charts in similar ways.

But wait! Does that number actually mean anything or is it a sampling error? Hmm... that’s a good point. After all, this pie chart just shows that of all the total bookmarks in the sample, 23.2% of the total went to fics with an explicit rating.

One way we can look at the relationship between fic stats and work rating is by looking at the quantiles, mean, and standard deviation. Thankfully, the pandas package does this automatically.

(No, I didn’t do the calculation in excel, I just copied the dataframe into excel to make it prettier)

This could also be represented as a box plot or violin plot, but we’ll be real. I’m lazy and it’s late, so maybe I’ll do an addendum later. In the meantime, let’s talk about what these numbers mean.

You may be familiar with Mean and STD, which mark a normal distribution of the data. That is, the data is very likely to fall at the middle, give or take the standard deviation (STD). So if the data WERE actually normally distributed, then a General Audience work would have about a 68% chance of getting 9-10 bookmarks, give or take 32 bookmarks. Okay, so that’s a HUGE margin... and also not actually an accurate model of the data. This would be apparent with a box or violin plot but there’s also a way to tell by looking at the quantiles.

The quantiles are the values represented by the 25%, 50%, and 75% values up at the top. Though they may look confusing, they aren’t too hard to read either. Basically, for a General Audiences work, 25% of the works had 2 or less bookmarks. 50% had 3 or less, and 75% had 8 or less.

Now wait a second, you may be thinking, The average number of bookmarks per work is clearly 9.7, so shouldn’t the 50% quantile be close to that value? The answer would be yes... if the data were actually following a normal distribution, but look at the max values on the far right. Those are some clear outliers!

So in this case the quantiles actually give up more information by telling us where the median of the data is (the 50% quantile).

Another way we can think of the data presented is as a probability. If you, spiders georg, were to post an explicit work on AO3, you would have approximately a 50% chance of getting at least 8 bookmarks over the life of the story.

Okay, so what does this mean in terms of analyzing if there’s a correlation between fic ratings and bookmarks? Well, we can do a quick compare of the data in a test that’s very accurate indeed*, the eyeball test.

*It’s not.

So by just looking at the data, we see that there’s a skew in the explicit data showing that explicit rated works are more likely to get bookmarks than the other types of fic. This actually matches up pretty well with the pie chart shown up above, but how about Kudos, Comments, and Hits?

So, actually, across the four different metrics, it looks like explicit rated fic tends to perform better.

Huh, what do you know. It looks like the old adage about sex sells seems to hold true on Archive of Our Own.

#data science #archive of our own #ao3 #ao3datafan

105 notes · View notes

ao3datafan · 7 years ago

Text

About DataFan

Hi everyone! My name’s Lockea and you’ve found my blog. Congrats! I’m a data scientist and acafan (academic fan, a member of the fandom who also studies fandom). My main acafan project right now is comprehensive data analytical analysis of what people read and write on AO3.

So what is DataFan? DataFan is a project where I answer questions about fanfiction using data analytic techniques. I answer everything from the easy (want to know what story rating gets the most hits?) to the not so easy (Is there a relationship between the average length of a chapter and how many comments the chapter gets?). If you’ve got a question about AO3, I’ll do my best to answer it.

What about Fandom Stats? Fandom Stats is a great project and I looked at their source code when I was first starting out with DataFan (both projects are written in the same programming language!) but Fandom Stats can only answer simple questions about fandom and can’t perform any analytical analysis. This isn’t a bash on Fandom Stats! It’s a great tool but DataFan was written with the goal of eventually performing complex artificial intelligence and predictive analytics on fandom trends.

Can I ask DataFan a question? Yes! Absolutely! Simply send me an ask or submit a post for questions that exceed ask lengths.

Is DataFan Open Source? Yep! It’s written in python using all open source tools. Send me an ask or DM and I’ll send you a link to the github repo.

Can I use DataFan for myself? You can use the code I use to pull the data from AO3 for your own use. You can use the Jupyter Notebooks written for DataFan analysis to construct your own queries. You CANNOT use my research directly without citations. If you want to use my research, please contact me so we can get the right citations and licenses figured out (don’t worry, it doesn’t cost money). All that said, unless you’re planning to study or practice data science yourself, you’re better off just asking me a question and I’ll do the querying.

Wait, can you clarify that? If you ask DataFan a question, then I perform the analysis and write up a short article discussing the results of the analysis. This resulting article and data can be cited. For example, if you’re writing a paper for school on fandom and want to know something DataFan can answer.

The only time you may want to perform your own analysis and NOT ask me a question is if you are studying data science or practice data science yourself. In that case, you may use the AO3 scraper and write your own queries. Please do not directly copy and paste any work I’ve done (including copying my Jupyter Notebooks) and call it your own, however, as that’s just plain not cool. The exception is if you wish to rewrite a Notebook as a learning experience.

(Basically, don’t be a dick and give credit where credit is due)

What else does DataFan do? Currently, I have a panel at various conventions called “How to Write the Perfect Fanfiction” which is half about my 20+ years experience as a fanfiction writer and half data evidence backed silliness about fandom using DataFan’s backend. Want me to come to a convention near you? Let me know! I’m based out of CA but I travel all over the USA. I’m also in the process of writing several peer reviewed articles on data science and fandom studies.

49 notes · View notes