The "deep" in deep learning refers to hierarchical layers of representations (to...

rndn · on June 14, 2015

Couldn't you treat the network as a matrix and perform addition and subtraction on it in the same manner?

Smerity · on June 14, 2015

I'm uncertain whether you mean the algorithm or the output. The question is interesting to both however.

The most common method for producing word vectors, skip-grams with negative sampling, has been shown to be equivalent to implicitly factorizing a word-context matrix[1]. A related algorithm, GloVe, only uses a word-word co-occurrence matrix to achieve a similar result[2].

You can also view the output as an embedding in a high dimensional space (hence the name word vectors) but more surprisingly you can learn a linear mapping between vector spaces of two languages, which lends it immediately useful in translation. From [3]: "Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish".

[1]: "Neural Word Embedding as Implicit Matrix Factorization" http://papers.nips.cc/paper/5477-neural-word-embedding-as-im...

[2]: http://nlp.stanford.edu/projects/glove/

[3]: "Exploiting Similarities among Languages for Machine Translation" - page 2 has an intuitive 2D graphical representation http://arxiv.org/pdf/1309.4168.pdf

agibsonccc · on June 14, 2015

I think people see it as "deep" (even though it's not!) due to the representation learning component. Word2vec is often used as PART of a deep neural network for feature engineering.

datacog · on June 14, 2015

> to note: you can do "deep learning" without neural networks

Curious. Can you share a few examples and applications please?

_ntka · on June 14, 2015

Here's a relevant link: http://www.researchgate.net/post/Is_deep_learning_with_decis...

Two examples:

- using layers of random forests (trained successively rather than end-to-end). Random forests are commonly used for feature engineering in a stack of learners.

- unsupervised deep learning with modular-hierarchical matrix factorization, over matrices of mutual information of the variables in the previous layers (something I've personally worked on; I'd be happy to share more details if you're interested).

datacog · on June 14, 2015

Thanks!

Are these methods main stream? Esp. the layered RF, how good/bad does it do as compared to regular ones?

agibsonccc · on June 14, 2015

Not particularly. The desire of neural networks here are the non linear transforms you can do with the data. There's definitely some appeal and things to try here though. Gradient boosted trees and other attempts to augment random forest are pretty main stream though.

ajtulloch · on June 14, 2015

Nit: gradient boosting isn't an 'augmentation' of random forests - if anything, it's the other way round. AdaBoost is from 1995, the GBM paper was 1999, and Breiman's random forest paper in 2001 explicitly couches it as an enhancement to AdaBoost.

agibsonccc · on June 15, 2015

Good point! Terrible wording on my part.

ma2rten · on June 14, 2015

I think you could see it as a sort of recurrent neural network.

_ntka · on June 14, 2015

How? There is no recurrent data flow in word2vec. Word2vec maps words and their context with 2 embeddings, a dot product and a sigmoid. That's it.