not_unwise: ([peter] i'm batman)
Sophie ([personal profile] not_unwise) wrote2010-11-29 03:07 pm

Omnomnom stats

Yuletide 2009 Statistical Analysis of Fics' Lengths
Or, Sophie Got Bored


Sometimes, I look at my Maths/Stats books and I end up wondering why the hell am I studying in this again. When this happens, I remind myself that I do love stats and maths by making fun, simple stats out of fun things.

This time, I decided to make a statistical analysis of the length of Yuletide fanfictions in 2009.

(Note: I am French and am studying in French, so the vocabulary specific to mathematics may be wrong, sometimes, because I'll be using the French words by accident.)

There were 2732 fics submitted for Yuletide 2009. Making statistics about 2732 values would be long and I am a lazy person. My solution to this problem was to analyse two sets of data. First, the length of the fics by one-thousand-word intervals, and second, the length of a random sample of the 2009 fics.


1) Intervals

Intervals are mostly useful in this situation because you can make pretty graphs out of the data collected. We're in a deeply asymmetric situation with more than one outliers. Two people wrote over 30,000 words.

The detailed data of the number of fics in each thousand-word interval can be found here, in the first spreadsheet.

And since intervals are useful for the sake of pretty graphs, I made pretty graphs (except that they're not pretty, because, once again, I am a lazy person).



Every bar represents the number of fics of a certain length. The first bar on the left represents fics between 0k and 1k. The first bar on the right represents the number of fics between 43k and 44k. The highest bar is, predictably, the "between 1k and 2k" one, as more than half the fanfictions are in that interval.

Random fact: only 3% of the fanfictions were over 10,000 words.

The average calculated with the average of each interval is 2.86k. This average is being pulled up by the outliers, and this is made even more obvious if we make a box plot out of the data.

So, now, have a box plot (can you tell it was drawn in a minute?):


(Yes, it includes the outliers. Have I mentioned I am lazy?)

The informations about this box plot (in number of words):
Min: 308
1st quartile: 1294
2nd quartile (median): 1870
3rd quartile: 3175
Max: 43,027

This means that:
-Half the fics were under 1870 words.
-Half the fics were between 1294 and 3175 words.
-A fourth of the fics were under 1294 words and a fourth over 3175.
-The ridiculously far outliers are pulling the average with them.


2) Random sample

I wanted a random sample with 100 fic lengths, so I decided to take the first 100 fics in alphabetical order of the writers. It is not a perfect choosing method or anything, but it's still a unbiased random sample, whereas choosing 100 fics after ordering them by date wouldn't have been since the data wouldn't have been independent.

This data is collected here too, for the interested, in the second spreadsheet.

The average of the sample is x=2901.75.
The standard deviation calculated considering this is a sample (s and not σ -- so with a division by [n-1]) is s=2686.6759

Knowing these two things means it's possible to calculate the average length of the Yuletide fics with a confidence interval, now (by approximating the average as a normal distribution N~[2901.75; 268.67]).

The average length of the 2009 Yuletide fics was 2901.75 ± 526.59 words, 95% of the time.

This means that my sample allows me to say that I am 95% sure that the real average of the fics' length is between 2375 and 3428 words. To have an average with a confidence interval of 400 words (95% of the time, still) I would need a sample of 174 fics. I am not going to do this. We will have to satisfy ourselves with an average that vague.

Edit: That was a lie! [personal profile] astolat went and found the actual average. µ = 2766.8364. My sample was apparently good enough!


I hope at least one stats-geek read through this and was mildly interested.

(Edited on December 6th because of a mistake in the way I calculated the confidence interval. Maybe the fact I caught it shows I'll pass my next stats exam.)
oxoniensis: text: godwottery (in gold on a plain plum coloured background) (words: it's a good one)

[personal profile] oxoniensis 2010-11-29 08:17 pm (UTC)(link)
At least one stats-geek read it, and as well as enjoying the stats, was rather chuffed to find her stories above average length!
shetiger: (WC Peter smiles)

[personal profile] shetiger 2010-11-29 08:25 pm (UTC)(link)
...this actually makes me feel a lot more comfortable about the thought that the story I'm writing may very well fall between the 1-2k range. Thank you!

[identity profile] myr-soleil.livejournal.com 2010-11-29 08:31 pm (UTC)(link)
This is awesome! I always love posts like this. And I feel way, way better about my two 6k fics! For some reason I thought all yuletide fics were much longer than 1k.
mscongeniality: (Default)

[personal profile] mscongeniality 2010-11-29 08:34 pm (UTC)(link)
Wow. Interesting and I feel guilty about not having done my Statistics assignment. A dual purpose entry!
mscongeniality: (Default)

[personal profile] mscongeniality 2010-11-30 04:00 am (UTC)(link)
Ordinarily, I'd agree with you. Unfortunately, if I don't pass Statistics this time, I'm pretty much never going to get my degree. :(
sophia_sol: Text saying "fascinating" with the Star Trek logo beneath it (ST: fascinating)

[personal profile] sophia_sol 2010-11-29 08:39 pm (UTC)(link)
Ooh, this is fascinating! I never liked stats class, but statistics themselves can be really interesting -- as this is!

This os Alley from lj

(Anonymous) 2010-11-29 08:47 pm (UTC)(link)
Agh this makes me miss stats so much! Awesome post <3
hardboiledbaby: (hardboiledbaby)

[personal profile] hardboiledbaby 2010-11-29 08:48 pm (UTC)(link)
the real average of the fics' length is between 2132 and 3671 words

Yay, I iz average! (as opposed to normal LOL)
This is kewl, thanks for sharing!
rebecca2525: Abby Sciuto from NCIS with the word "geek" (Default)

[personal profile] rebecca2525 2010-11-29 08:50 pm (UTC)(link)
WHEEEE! I LOVE YOU! *noms stats*
lds: my tattoo (Default)

[personal profile] lds 2010-11-29 08:54 pm (UTC)(link)
Stats, you gotta love them.
wneleh: by Mirnell (Default)

[personal profile] wneleh 2010-11-29 08:59 pm (UTC)(link)
Holy Poisson distribution, Batman!
ancarett: (PhD Tricks)

[personal profile] ancarett 2010-11-29 09:01 pm (UTC)(link)
Fun with numbers! I was always wondering how the yuletide treats would skew the distribution but the possibility of writing shorter than 1K fics doesn't appear to have had a major impact.

[personal profile] jozpierce 2010-11-30 08:18 pm (UTC)(link)
To be technical, everyone wrote 1000+ words for Yuletide. You can't upload if it's under 1000 words.

However, 632 fics were under 1000 words for Yuletide Madness. These were not included in these stats.

The 18 that were under 1000 words are technically not under 1000 for Yuletide. I would imagine most of those were edited down to that amount. If they had to be over 1000 words, there's no other explanation that I can think of.

On the other hand, there were only 11 Yuletide Madness fics that were over 1000 words.

So, everyone wrote at least 1000 words for Yuletide. But only a few people wrote at least 1000 words for Madness.

Geek me.

[personal profile] jozpierce 2010-12-01 03:50 am (UTC)(link)
Oh, the mystery that is AO3... Can we file this under "Yuletide Magic"?

franzeska: (Default)

[personal profile] franzeska 2010-11-29 09:03 pm (UTC)(link)
Very interesting! I wonder if there's any relationship between length and recs, comments, hits, hard to measure factors of appreciation, etc.
isis: (geeky)

[personal profile] isis 2010-11-29 09:29 pm (UTC)(link)
As the margins of this post are not sufficient for me to calculate this, it is left as an exercise for the reader.

:-)
james: (rodney)

[personal profile] james 2010-11-29 09:36 pm (UTC)(link)
This is really interesting and cool! Just last week I was angsting over how long my story should be as compared to everyone else's story and this helps me realise that most stories are about the same length as what I've written. So I don't feel badly about 'shorting' my recipient.

Also, yay maths!
heeroluva: (Default)

[personal profile] heeroluva 2010-11-29 09:36 pm (UTC)(link)
This makes me want to write something long now, just to be different. ;)
ankaret: Picture of woman with a cat (alot)

[personal profile] ankaret 2010-11-29 10:08 pm (UTC)(link)
Wow, thank you for doing this. Fascinating!

[personal profile] scissorphishe 2010-11-29 10:32 pm (UTC)(link)
I must confess I had a terrible time with my stats class in high school, but it was worth it to be able to understand this post! :)

(Anonymous) 2010-11-29 10:41 pm (UTC)(link)
This is absolutely wonderful and just happy-making to read. I love it. And I failed stats the first time through.

Alianne
robespierre: (Default)

[personal profile] robespierre 2010-11-29 11:02 pm (UTC)(link)
Sophiiiiie. These are the best stats. I am so pleased.

Also, as penitence for asking for this:



I made a box and whiskers plot with outliers!
robespierre: (Default)

derp derp derp

[personal profile] robespierre 2010-11-29 11:04 pm (UTC)(link)
Notes on the features of this plot:

The left edge of the gray area is zero; the right side is 45,000 words. All fics over 5997 words were outliers. This graph is scaled to 5,000 words:1 inch.
robespierre: (Default)

Re: derp derp derp

[personal profile] robespierre 2010-11-30 12:20 am (UTC)(link)
In box and whisker plots, I think, it's if the datum is more than 1.5 times the interquartile range away from the upper or lower quartile.

Yes.
robespierre: (Default)

Re: derp derp derp

[personal profile] robespierre 2010-11-30 12:26 am (UTC)(link)
They're my favorite because I am a social scientist ish at heart. They are not hardcore maths at all. Also, they are so pretty!

Love.
robespierre: (Default)

[personal profile] robespierre 2010-11-30 12:36 am (UTC)(link)
I spent an hour collecting data and drawing that. HUMANITIIIIIIIES.
tagalongcookies: (Default)

[personal profile] tagalongcookies 2010-11-30 12:37 am (UTC)(link)
I wrote like 9k last year...apparently I was one of the few! Still, as a fellow geek this was fascinating, ty bb \o/
astolat: lady of shalott weaving in black and white (Default)

[personal profile] astolat 2010-11-30 01:09 am (UTC)(link)
Heeee, this made me curious, so I poked the db for the yuletide2009 collection and got:

mysql> select avg(word_count) from works inner join collection_items on collection_items.item_id = works.id and collection_items.item_type = 'Work' inner join collections on collection_items.collection_id = collections.id where collections.name = 'yuletide2009';
+-----------------+
| avg(word_count) |
+-----------------+
| 2766.8364 |
+-----------------+

So you were right about where the average would fall! (Although this is just the stories in Yuletide 2009, not Yuletide Madness.) :D
quillori: dice (theme: chance)

[personal profile] quillori 2010-11-30 08:48 am (UTC)(link)
This is cool! Thanks for doing this, it's very interesting (and quite comforting - like a number of other commenters I was convinced everyone else was writing much longer stuff than me).
blueyeti: A grinning blue Yeti with sneaky claws popping up. (Default)

[personal profile] blueyeti 2010-11-30 11:11 am (UTC)(link)
Hee. This is awesome. I love when maths and stats are applied to fandom. It proves that fandom is bigger than creative writing degrees.

You PWN my pretentious university in matters of awesome.
imnotcaroline: (Default)

[personal profile] imnotcaroline 2010-12-06 02:36 am (UTC)(link)
hahaha, you dork :3










and stop messing with me with your word count :
imnotcaroline: (Default)

[personal profile] imnotcaroline 2010-12-06 02:41 am (UTC)(link)
did you bust 3k yet? :O