Modelling the Web
Possible models:
Document sizes: "Heavy tail" distribution
The big majority are small documents but the number of big documents is important
Vocabulary: sublinear growth (Heaps' law)
Word distribution: generalized Zipf's law