### Research Takeaway : Author Outreach!

If you are the author of research for which I've written a 'takeaway' and you would like to comment or clarify - please let me know : I'd be very happy to learn how I can put things right.

This write-up contains my first impressions of the paper : Learning word embeddings efficiently with noise-contrastive estimation - (Mnih 2013)

### Paper Background

Calculating word embeddings can be very computationally expensive, since ‘true distributions’ over the vocabulary take O(V)) time, unless a tree embedding is used, in which case O(log(V)) is possible.

Instead of going for the full distribution, however, a number of simplifying steps can be made :

• Only ratios are required, so the (expensive) normalizing constants can be discarded

• Full distributions can be approximated by taking samples : And this works really well.

### Surprising stuff

• The major surprise here is that the method works so well

• Particularly since the ‘sophisticated structure’ approaches (such as trees) were thrown out in favour of just processing larger quantities of data more efficiently

• The NCE idea had already been explored in a slightly different context in : Quick training of probabilistic neural nets by importance sampling - (Bengio and Senécal 2003).

• Although the NCE is claimed to be constant time, there is a twist. The simple act of picking k weighted noise samples (each from a vocabulary of size V) could easily be an O(log(V)) operation. One way around this is to pick words sequentially (or with a stride, etc), from the corpus itself (since that would have the right statistics, by construction).

• Experimental detail : Thinned out words by rejecting any with <10 occurrences

### Ideas to Follow-Up

Multiple representations for different senses of words.

The basis of this work is analysed and recostructed within a more general setting in the GloVe paper : Essentially the same thing can computed more generally, faster, and with OSS code