Building Google’s 1-billion-word language modelling benchmark is much more involved than expected,
and also contains a large number of duplicate sentences. When removed, the number of words in the corpus is reduced from 2.9G to 0.8G.
OTOH, the checksums proved that the process was going Ok, even though the line-by-line aggregate produced
mid-way seemed to have extra double-quotation marks at the beginning of each line.
All that being done, the version produced ends up IDENTICAL (eg: same md5sum) to the version issued by Kaggle
for their 1 Billion Word Imputation competition.
I should really submit a PR that points that out to the corpus’ GitHub README…