The "deep" in deep learning refers to hierarchical layers of representations (to note: you can do "deep learning" without neural networks).
Word embeddings using skipgram or CBOW are a shallow method (single-layer representation). Remarkably, in order to stay interpretable, word embeddings have to be shallow. If you distributed the predictive task (eg. skip-gram) over several layers, the resulting geometric spaces would be much less interpretable.
So: this is not deep learning, and this not being deep learning is in fact the core feature.
I'm uncertain whether you mean the algorithm or the output. The question is interesting to both however.
The most common method for producing word vectors, skip-grams with negative sampling, has been shown to be equivalent to implicitly factorizing a word-context matrix[1]. A related algorithm, GloVe, only uses a word-word co-occurrence matrix to achieve a similar result[2].
You can also view the output as an embedding in a high dimensional space (hence the name word vectors) but more surprisingly you can learn a linear mapping between vector
spaces of two languages, which lends it immediately useful in translation. From [3]: "Despite its simplicity,
our method is surprisingly effective: we can
achieve almost 90% precision@5 for translation
of words between English and Spanish".
[3]: "Exploiting Similarities among Languages for Machine Translation" - page 2 has an intuitive 2D graphical representation http://arxiv.org/pdf/1309.4168.pdf
I think people see it as "deep" (even though it's not!) due to the representation learning component. Word2vec is often used as PART of a deep neural network for feature engineering.
- using layers of random forests (trained successively rather than end-to-end). Random forests are commonly used for feature engineering in a stack of learners.
- unsupervised deep learning with modular-hierarchical matrix factorization, over matrices of mutual information of the variables in the previous layers (something I've personally worked on; I'd be happy to share more details if you're interested).
Not particularly. The desire of neural networks here are the non linear transforms you can do with the data. There's definitely some appeal and things to try here though. Gradient boosted trees and other attempts to augment random forest are pretty main stream though.
Nit: gradient boosting isn't an 'augmentation' of random forests - if anything, it's the other way round. AdaBoost is from 1995, the GBM paper was 1999, and Breiman's random forest paper in 2001 explicitly couches it as an enhancement to AdaBoost.
Word embeddings using skipgram or CBOW are a shallow method (single-layer representation). Remarkably, in order to stay interpretable, word embeddings have to be shallow. If you distributed the predictive task (eg. skip-gram) over several layers, the resulting geometric spaces would be much less interpretable.
So: this is not deep learning, and this not being deep learning is in fact the core feature.