I got into quant finance 12 years ago with the mistaken idea that I was going to successfully use all these cool machine learning techniques (genetic programming! SVMs! neural networks!) to run great statistical arbitrage books.
Most machine learning techniques focus on problems where the signal is very strong, but the structure is very complex. For instance, take the problem of recognizing whether a picture is a picture of a bird. A human will do well on this task, which shows that there is very little intrinsic noise. However, the correlation of any given pixel with the class of the image is essentially 0. The "noise" is in discovering the unknown relationship between pixels and class, not in the actual output.
Noise dominates everything you will find in statistical arbitrage. R^2 of 1% are something to write home about. With this amount of noise, it's generally hard to do much better than a linear regression. Any model complexity has to come from integrating over latent parameters or manual feature engineering, the rest will overfit.
I think Geoffrey Hinton said that statistics and machine learning are really the same thing, but since we have two different names for it, we might as well call machine learning everything that focuses on dealing with problems with a complex structure and low noise, and statistics everything that focuses on dealing with problems with a large amount of noise. I like this distinction, and I did end up picking up a lot of statistics working in this field.
I'll regularly get emails from friends who tried some machine learning technique on some dataset and found promising results. As the article points out, these generally don't hold up. Accounting for every source of bias in a backtest is an art. The most common mistake is to assume that you can observe the relative price of two stocks at the close, and trade at that price. Many pairs trading strategies appear to work if you make this assumption (which tends to be the case if all you have are daily bars), but they really do not. Others include: assuming transaction costs will be the same on average (they won't, your strategy likely detects opportunities at time were the spread is very large and prices are bad), assuming index memberships don't change (they do and that creates selection bias), assuming you can short anything (stocks can be hard to short or have high borrowing costs), etc.
In general, statistical arbitrage isn't machine learning bound(1), and it is not a data mining endeavor. Understanding the latent market dynamics you are trying to capitalize on, finding new data feeds that provide valuable information, carefully building out a model to test your hypothesis, deriving a sound trading strategy from that model is how it works.
(1: this isn't always true. For instance, analyzing news with NLP, or using computer vision to estimate crop outputs from satellite imagery can make use of machine learning techniques to yield useful, tradeable signals. My comment mostly focuses on machine learning applied to price information. )
A few years ago I graduated with a PhD in statistics with lots of ML inspiration. Since then I have always dreamed of applying my knowledge and skill in this domain. However, despite the belief I was 'probably' in a decent position to do so, I consistently read about how impossible it was. I have a boring 'normal' persons job, but, posts like this are somewhat reassuring that I made a reasonable decision to abandon a life of fruitless datamining and overfitting.
I am keen to second here. With a PhD in probability & loads of experience in data analytics, my experience has told me that we are too ignorant and sometime too ambitious to try to predict a the outcome of a stochastic process (e.g. financial time series) without knowing that the amount of information required to make a sound prediction is far beyond those we have. Unless there's a very clear dominating signal among thousands of information sources, very often we are trading on noise.
Although I don't necessarily agree with all the points in this article, it just reminds me what Poincaré said:
`You ask me to predict for you the phenomena about to happen. If, unluckily, I knew the laws of these phenomena I could make the prediction only by inextricable calculations and would have to renounce attempting to answer you; but as I have the good fortune not to know them, I will answer you at once. And what is most surprising, my answer will be right.'
-- Poincaré, H. (1913) The Foundations of Science. New York, The Science Press. p. 396.
I don't think the message here is "don't do it," but "have domain knowledge." The crux of the paper was scientists applying ML to a bunch of data without really understanding trading.
You can actually have scientists find signals in data they have no domain experience in. In a typical hedge fund the quantitative researchers will be a different group from the quantitative developers and traders. There are fuzzy lines between those depending on culture, but those three groups are broadly the front office. You really need domain experience for execution and risk management, but pure insights can be derived without necessarily needing any domain experience.
That said, quant researchers typically understand how the market works. They are just able to quickly excel without a background in it.
It's easy to forget that this is a highly competitive field.
You're used to see the techniques you work with capture signal because there isn't an army of PhDs in math, physics, and computer science working around the clock to trade any signal out of that data.
In the end, it doesn't even matter if you're the best statistician in the world: whatever signal you detect may simply not be worth the effort you put into detecting it.
> Most machine learning techniques focus on problems where the signal is very strong, but the structure is very complex. For instance, take the problem of recognizing whether a picture is a picture of a bird. A human will do well on this task, which shows that there is very little intrinsic noise. However, the correlation of any given pixel with the class of the image is essentially 0. The "noise" is in discovering the unknown relationship between pixels and class, not in the actual output.
Could it be that by looking only at the prices timeseries we are not looking at the actual information but only at the output of a irreversible function and that for effectively predicting the prices we need a model that captures what actually happens in the real world?
It's hard for me to follow exactly what you're getting at here, but (to make an analogy to cryptography) it seems like you're saying it's hard to find a signal because we only have the apparently random output of a function, not the seed itself.
It's fairly true that (at least these days) you're not going to identify a signal just by looking at a timeseries of prices, no matter how granular your dataset is (up to and including tick data). There are pockets of repeating patterns but those are vanishingly small and fleeting; the prices themselves may as well be stochastic.
Essentially all funds are empowered with significant amount of data, and the prices themselves are just used for backtesting and sanity checking. It's the source of truth, but it's not the way in which new insights are identified. The signal comes from other types of data that is far more reversible.
"""
Our historical dataset contains 5 minute mid-prices for 43 CME listed commodity
and FX futures from March 31st 1991 to September 30th, 2014. We use the
most recent fifteen years of data because the previous period is less liquid for
some of the symbols, resulting in long sections of 5 minute candles with no price
movement. Each feature is normalized by subtracting the mean and dividing by
the standard deviation. The training set consists of 25,000 consecutive observations
and the test set consists of the next 12,500 observations.
"""
Could you elaborate on news & NLP in regards to stocks?
We tried sentiment analysis in uni a few years ago and had no good results:
The idea was essentially: news says: 'stock A is great' -> it goes up shortly thereafter
We tested our algorithms against classifying Amazon reviews & Tweets by sentiment. Those are filled with sentiment and easy to detect if it is a 5 star review or a 1 star review. The news articles we parsed all had near neutral sentiment. We ended up building a classifier that could detect the news category of an article quite easily instead.
My initial idea was sparked by the Gulf spill and the subsequent dip in BP, I wanted to detect and capitalise big events like that, but the news sources we parsed always seemed to significantly lag behind the stock movement, too.
I just did a project on this, but for Bitcoin instead of stocks where I examined news, Reddit, forum (Bitcointalk.org) and IRC sentiment using some simple ML algos. The goal was to determine whether this data has any predictive causality.
I scraped the above sources over a full year (2015) and then had the data annotated on positive, negative and neutral sentiment.
The problem with labeling sentiment data is that there might not be a single 'true' label due to varying interpretations and ambiguity. So at best you'll get to 80-85% accuracy there. The less formal (News > Reddit/Forum > IRC), the lower your accuracy due to lack of context.
I Then matched the annotated sentiment to market data and did some causality analysis. What I found is that interestingly, you can't just say positive news = price/volume goes up. It is way more fine grained than that. For example negative Reddit sentiment leads price movements, but price movements lead positive sentiment. For news its the reverse.
All in all I didn't incorporate this into any trading strategies, but found it interesting to see the differences between online sentiment channels.
That distinction from Hinton is quite interesting, I often work with "Machine Learning" models, but for the most part the models are regression models and/or tree classification models that are easier to understand conceptually. I have yet to actually implement any complex machine learning techniques because I fear even if they do or do not work, I won't be able to interpret them in terms of their statistics.
What's your view on cases like Renaissance Technologies? There's an interview online on James Simons (https://www.youtube.com/watch?v=QNznD9hMEh0) - and he explicitly talked about using math models to detect anomalies, e.g. trends. It's also known that they've used Hidden Markov Models, at least in the early days.
Rentec uses machine learning, but more importantly the firm curates massive amounts of high-signal data. The most significant part of their work lies in process automation and the rapid testing of hypotheses, which empowers the optimal use of mind-numbing amounts of data that most other firms simply can't take advantage of. Very early on their success was due (in part) to the willingness of Simons et al to use correlations in disparate datasets which could be proven but which didn't really make sense, and which wouldn't be explained by anything intuitive.
In other words, Rentec is not just pointing machine learning models at data, they're investing in a very robust data processing pipeline. Everything before the analysis is just as important as the analysis at funds like theirs.
Just want to second this comment, their data processes are the key strength of Medallion. Grandparent comment by murbard2 also talks about the importance of this component to quant work (in the last paragraph: "finding new data feeds that provide valuable information")
While Jim Simons is a mathematician and Rentec clearly has hired many brilliant people with PhDs, it's maybe worth mentioning the actual mathematics being used in their work isn't super high level difficult, impossible or secretive. Many of the PhD's working there do not have a PhD in math, but rather something like Physics, so I would say if you are familiar with graduate level math courses you can understand the math needed for this type of work. Math isn't where their edge comes from. Also Medallion is 30 years old, their early work in the mid 80s was done on computers with less processing power than your phone, "Machine Learning" as the term is being used lately or access to supercomputing hardware no one else knows about is also not where their edge came from.
Well said. Where most funds have the same problem 'chollida1 describes here[1], Rentec (and other similar firms) moved past that by establishing the right culture and investing in the right technology from the outset.
They need smart people, but hiring the smartest people and having the most sophisticated models won't do you any good if you can't acquire high signal data, can't clean that data properly and can't rapidly backtest. And if you can't do any of that, adding more data is just going to add more noise.
Given your background, I'd be interested in picking your brain a bit for a few projects I'm working on. If you're looking to remain anonymous would you mind sending me an email (in my profile), or throwing an email up in yours?
Is there really that much high-signal data that is not already being used? Is their advantage in finding signals in data that other people overlook, or finding new data sources?
> Is there really that much high-signal data that is not already being used?
Yes.
> Is their advantage in finding signals in data that other people overlook, or finding new data sources?
Yes.
I won't go into any particular detail, but there is a lot of signal in the market for those who are imaginative. The obvious and low hanging fruit is long gone, but there are still many places that offer an edge.
I wish there was some sort of full intro tutorial on finding strategies; ie: an example of a former signal (now traded away), the thought process, the data sourcing, statistical analysis, trading/signal strategy, etc..
The thing is that no one is really motivated to make a complete tutorial on finding strategies because it's economically irrational. You're either giving away specific sources of alpha or you're empowering potential competitors. This is why it's virtually guaranteed that anyone selling courses that teach trading or related skills is a fraud - they have essentially no incentive to just ramp up their own trading capital instead.
The industry is also extremely secretive (necessarily so). You'll hardly ever find a good treatise on finding novel signals, but there are tutorials on algo trading in general with examples of production strategies that used to work which have been, as you say, traded away. For that purpose I'd recommend you start here: http://www.decal.org/file/2945
It's public (technically it has to be in order to be strictly legal for use). But for the most part it is unintuitive, unclean (needs to be heavily normalized) and not easily accessible. There are a variety of vendors that source it, clean it and analyze it to make it salable to firms. Quantitative firms also have teams devoted to doing all of that internally.
I like the responses to this already. But I'll add that there's a difference between what I loving call throwing poop at the wall, and using machine learning to estimate non-linear functions of structural models or combining signals that already have alpha.
ML can be very useful if you have some signal or if you have a model.
Most machine learning techniques focus on problems where the signal is very strong, but the structure is very complex. For instance, take the problem of recognizing whether a picture is a picture of a bird. A human will do well on this task, which shows that there is very little intrinsic noise. However, the correlation of any given pixel with the class of the image is essentially 0. The "noise" is in discovering the unknown relationship between pixels and class, not in the actual output.
Noise dominates everything you will find in statistical arbitrage. R^2 of 1% are something to write home about. With this amount of noise, it's generally hard to do much better than a linear regression. Any model complexity has to come from integrating over latent parameters or manual feature engineering, the rest will overfit.
I think Geoffrey Hinton said that statistics and machine learning are really the same thing, but since we have two different names for it, we might as well call machine learning everything that focuses on dealing with problems with a complex structure and low noise, and statistics everything that focuses on dealing with problems with a large amount of noise. I like this distinction, and I did end up picking up a lot of statistics working in this field.
I'll regularly get emails from friends who tried some machine learning technique on some dataset and found promising results. As the article points out, these generally don't hold up. Accounting for every source of bias in a backtest is an art. The most common mistake is to assume that you can observe the relative price of two stocks at the close, and trade at that price. Many pairs trading strategies appear to work if you make this assumption (which tends to be the case if all you have are daily bars), but they really do not. Others include: assuming transaction costs will be the same on average (they won't, your strategy likely detects opportunities at time were the spread is very large and prices are bad), assuming index memberships don't change (they do and that creates selection bias), assuming you can short anything (stocks can be hard to short or have high borrowing costs), etc.
In general, statistical arbitrage isn't machine learning bound(1), and it is not a data mining endeavor. Understanding the latent market dynamics you are trying to capitalize on, finding new data feeds that provide valuable information, carefully building out a model to test your hypothesis, deriving a sound trading strategy from that model is how it works.
(1: this isn't always true. For instance, analyzing news with NLP, or using computer vision to estimate crop outputs from satellite imagery can make use of machine learning techniques to yield useful, tradeable signals. My comment mostly focuses on machine learning applied to price information. )