Sophie (
not_unwise) wrote2010-11-29 03:07 pm
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Entry tags:
Omnomnom stats
Or, Sophie Got Bored
Sometimes, I look at my Maths/Stats books and I end up wondering why the hell am I studying in this again. When this happens, I remind myself that I do love stats and maths by making fun, simple stats out of fun things.
This time, I decided to make a statistical analysis of the length of Yuletide fanfictions in 2009.
(Note: I am French and am studying in French, so the vocabulary specific to mathematics may be wrong, sometimes, because I'll be using the French words by accident.)
There were 2732 fics submitted for Yuletide 2009. Making statistics about 2732 values would be long and I am a lazy person. My solution to this problem was to analyse two sets of data. First, the length of the fics by one-thousand-word intervals, and second, the length of a random sample of the 2009 fics.
1) Intervals
Intervals are mostly useful in this situation because you can make pretty graphs out of the data collected. We're in a deeply asymmetric situation with more than one outliers. Two people wrote over 30,000 words.
The detailed data of the number of fics in each thousand-word interval can be found here, in the first spreadsheet.
And since intervals are useful for the sake of pretty graphs, I made pretty graphs (except that they're not pretty, because, once again, I am a lazy person).

Every bar represents the number of fics of a certain length. The first bar on the left represents fics between 0k and 1k. The first bar on the right represents the number of fics between 43k and 44k. The highest bar is, predictably, the "between 1k and 2k" one, as more than half the fanfictions are in that interval.
Random fact: only 3% of the fanfictions were over 10,000 words.
The average calculated with the average of each interval is 2.86k. This average is being pulled up by the outliers, and this is made even more obvious if we make a box plot out of the data.
So, now, have a box plot (can you tell it was drawn in a minute?):

(Yes, it includes the outliers. Have I mentioned I am lazy?)
The informations about this box plot (in number of words):
Min: 308
1st quartile: 1294
2nd quartile (median): 1870
3rd quartile: 3175
Max: 43,027
This means that:
-Half the fics were under 1870 words.
-Half the fics were between 1294 and 3175 words.
-A fourth of the fics were under 1294 words and a fourth over 3175.
-The ridiculously far outliers are pulling the average with them.
2) Random sample
I wanted a random sample with 100 fic lengths, so I decided to take the first 100 fics in alphabetical order of the writers. It is not a perfect choosing method or anything, but it's still a unbiased random sample, whereas choosing 100 fics after ordering them by date wouldn't have been since the data wouldn't have been independent.
This data is collected here too, for the interested, in the second spreadsheet.
The average of the sample is x=2901.75.
The standard deviation calculated considering this is a sample (s and not σ -- so with a division by [n-1]) is s=2686.6759
Knowing these two things means it's possible to calculate the average length of the Yuletide fics with a confidence interval, now (by approximating the average as a normal distribution N~[2901.75; 268.67]).
The average length of the 2009 Yuletide fics was 2901.75 ± 526.59 words, 95% of the time.
This means that my sample allows me to say that I am 95% sure that the real average of the fics' length is between 2375 and 3428 words. To have an average with a confidence interval of 400 words (95% of the time, still) I would need a sample of 174 fics. I am not going to do this. We will have to satisfy ourselves with an average that vague.
Edit: That was a lie!
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I hope at least one stats-geek read through this and was mildly interested.
(Edited on December 6th because of a mistake in the way I calculated the confidence interval. Maybe the fact I caught it shows I'll pass my next stats exam.)
no subject
no subject
no subject
no subject
no subject
no subject
no subject
Conclusion: feeling guilty is silly?
no subject
no subject
You should probably study, then :c
no subject
This os Alley from lj
(Anonymous) 2010-11-29 08:47 pm (UTC)(link)no subject
Yay, I iz average! (as opposed to normal LOL)
This is kewl, thanks for sharing!
no subject
no subject
no subject
no subject
no subject
no subject
no subject
However, 632 fics were under 1000 words for Yuletide Madness. These were not included in these stats.
The 18 that were under 1000 words are technically not under 1000 for Yuletide. I would imagine most of those were edited down to that amount. If they had to be over 1000 words, there's no other explanation that I can think of.
On the other hand, there were only 11 Yuletide Madness fics that were over 1000 words.
So, everyone wrote at least 1000 words for Yuletide. But only a few people wrote at least 1000 words for Madness.
Geek me.
no subject
no subject
no subject
no subject
no subject
:-)
no subject
Also, yay maths!
no subject
no subject
no subject
ALSO OMG AN ALOT.
I love alots a lot.
no subject
no subject
(Anonymous) 2010-11-29 10:41 pm (UTC)(link)Alianne
no subject
Also, as penitence for asking for this:
I made a box and whiskers plot with outliers!
derp derp derp
The left edge of the gray area is zero; the right side is 45,000 words. All fics over 5997 words were outliers. This graph is scaled to 5,000 words:1 inch.
Re: derp derp derp
Re: derp derp derp
Yes.
Re: derp derp derp
Yes.
Re: derp derp derp
Love.
no subject
no subject
no subject
Also I like making stats, especially when you're the one asking.
no subject
no subject
mysql> select avg(word_count) from works inner join collection_items on collection_items.item_id = works.id and collection_items.item_type = 'Work' inner join collections on collection_items.collection_id = collections.id where collections.name = 'yuletide2009';
+-----------------+
| avg(word_count) |
+-----------------+
| 2766.8364 |
+-----------------+
So you were right about where the average would fall! (Although this is just the stories in Yuletide 2009, not Yuletide Madness.) :D
no subject
Not that I have anything to do with this considering it was a... random... sample...
Maybe it means I'm good at random.
no subject
no subject
You PWN my pretentious university in matters of awesome.
no subject
and stop messing with me with your word count :
no subject
no subject
no subject