Back

Machine Learning for Equity Price Trend Prediction

Coming from Python and being relatively new in C#, I thought it would be helpful to have an example strategy that utilizes the Accord.NET machine learning library. Since I couldn't find any good examples, I coded one myself.

This strategy trains a Linear SVM (Support Vector Machine) with historical returns. Then at the open of the market, it attempts to predict whether the market will close UP or DOWN. If the trend is predicted to be UP, we enter a LONG position. If the trend is DOWN, we exit the market. If we are already in a LONG position, we do nothing. As mentioned, I am quite new to C#, so there might be bugs and there is certainly room for improvement. Would be nice to discuss possible improvements/ideas here.
Update Backtest








Thats awesome Gene, I've never used the Accord framework much. That is some pretty intense for looping :D Maybe @MichaelH could apply some LINQ magic to it to make it easier to follow. Essentially you're building the input

I wonder if it uses random numbers to initialize the SVM? For me it jitters to different solutions each time. It would be nice if we could standardize those initialization parameters so its more predictable.
1

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.


Hey @Gene!

Thanks for sharing this awesome SVM example! Maybe we can work on producing the training data sets easier. What kind of functions would you use in Python to perform the same operations. I could (probably) write some C# functions that behave very similarly to the Python functions you're used to. C# has LINQ which is a very powerful, monadic library of functions for iteratable sequences. Very cool stuff!

I went to take a stab at reducing some of those loops into (possibly) more readable LINQ statements, but without knowing the exact intent it was kind of hard (comments!). I surmised that we're producing bit arrays indicating whether we went up or down based on the close values in the rolling window, three results per day.

Maybe what we need is a RollingWindow>, sounds crazy, but what if when we added a new sample it would recursively add it to its children with a specified offset, I think then it may be very easy to define minimally the input training data.

Awesome stuff, thanks for sharing!
1

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.


Hi @MichaelH,

I guess I should have been a little more clear about what is going on in the code. I actually do use a rolling window to store daily data. Here is a run-down of what is happening:

The input to the SVM consists of an inputSize x trainSize array. In this simple example, I use change in the daily return. A +1 indicates an positive change, where a -1 indicates a negative change.

First I build the 2-D input array: Every day, I build an Inputsize number of past daily historical returns (3 in the default case). I do this for a rolling trainSize number of past days (30 by default), so I end up with a 2-D array that is fed into the SVM.

I then build the 1-D output array: Starting with today, and rolling backward trainSize number of times.

What I end up with is something like the following:

Input Array:
t-1, t-2, t-3
t-2, t-3, t-4
t-3, t-4, t-5
...

Output:
t
t-1
t-2
...

So the inputs (t-1,t-2,t-3) should be used to train the SVM to predict the return at time t. Finally, when I go to make an actual prediction, I feed in the data (t,t-1,t-2) and try to predict (t+1) tomorrow's change. In Python/Pandas, building the inputs and outputs for training the SVM is very easy, and I do something like the following:

# tslag gets the shifted price inputs
for i in range (0,inputSize):
tslag["Lag%s" % str(i+1)] = df["Close Price"].shift(i+1)
# now tslag holds the returns (ie. pct change)
tslag = tslag.pct_change()

# tsout will hold the returns and be the target for the SVM
tsout = df["Close Price"].pct_change();


Unfortunately, my C# skills are not up to snuff, so I'm not sure how to do the above in C# w/o lots of looping.
1

Hi Gene,

Thank you for sharing your idea. Your idea gave me some of my own. I think it's too much to ask an SVM to predict next bar direction from past bars alone. I believe a more profitable approach is to train SVM to validate setups. Suppose you have an awesome setup, like MA cross up: MA crosses up, buy, stop loss 1R, take profit 2R (or use some signal to exit). Obviously, it doesn't always work. So you train machine to validate it. This setup has indicator data points that make it fire (MA separation, MACD, a few past bars, day of month, month of year, whatever feels right), and after it completes it produces it's own data point : success or failure. These data points you train the machine on after each setup completes. It's a bit more work because first these setups need to simulate, and trade only after the machine is trained.

That's what I'm going to do. Maybe it's worth something.

Dmitri
0

Hey @Gene,

Thanks for the walk through on the input/outputs of the system. That's in line with what I thought it was doing. Hopefully we'll get some python support into LEAN sooner than later for you guys, in the mean time I may try and whip together some functions to make some things easier for you guys.

Hey @Dmitri,

Interesting concept! I would love to see what you come up. Your idea actually inspired another idea for myself. This would require a fair amount of processing power, but in short, what if you had a genetic algorithm building the setup and an SVM to validate the setup.
1

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.


So I realized the stop-loss function was not working as intended in the original code because I was using the Liquidate() function, which apparently closes the position at the End of Day instead of immediately. I've fixed the stoploss, and now the code seems to function better.
1


Hey @Gene,

there's a problem with how you store returns. Arrays are passed by reference, so you need to move

double[] returns = new double[inputSize];

under

for (int i=0;i

otherwise, all your inputs will be the same.
0

I played with SVM predictions, but didn't want to share this project because it's pretty rough. But since I'm now concentrating on more basic stuff like position management, I'm sharing this in hope it can help those who are walking this path. SVM learning has potential, but I think it requires some preparation to be used profitably.
3


Thanks for sharing this Gene. Let's say I use Accord.MachineLearning to create an SVM trained with a few years worth of data through an external program I write. After training, I save this SVM to a file. Is it possible to upload my trained SVM to QuantConnect and use it from within my algorithm? So far, I've only been able to create new code files.
0

I'm not sure what kind of format the data would have, but you could certainly make a constant string that represents the data and save it in a .cs file.const string SvmData = @"1,2,1,0,2,-1"
And then reference it from your algorithm and use it to hydrate your SVM instance.
0

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.


@Robert you can also import data using WebClient which is a C# class. I've posted a new QC University algorithm "QCU How Do I Import My Training Data Sets?" which has this snippet below:
using(var wc = new WebClient())
{
//Point the web client to your own data store.
_data = wc.DownloadString("https://www.google.com");
}
0

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.


@MichaelH,
Good point. Yes for a trained SVM I could export the weights and store them in a class file maybe as an array of doubles. But for other trained models it would be more difficult. Many of Accord's models only support saving/loading to a file or stream. I suppose it would be possible to serialize the stream to a base64 encoding and store that in the class file, but I'd prefer to avoid that.

Jared,
Thanks, yes I can make that work. To be clear, I don't want to load the data and retrain the model on every initialization. I just want to use the trained model in the algorithm. I can post my serialized model on google drive or dropbox, make the file public, load it using the webclient during initialize, and finally let Accord load it from the webclient stream.
0

EDIT: this won't work because accord uses binary formatter to serialize. Sorry
Hi @Robert,

you can convert string to bytes and write them to memory stream, or just create memory stream from bytes:


const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: ? ??";
var bytes = System.Text.Encoding.UTF8.GetBytes(data);

var stream = new System.IO.MemoryStream(bytes);
0

I've found that more recent versions of Accord have a serializer class which is more flexible:

http://accord-framework.net/docs/html/T_Accord_IO_Serializer.htm

I've been toying with the accord libraries, so thanks for sharing. One thing I'd like to do is incremental leaning, which I'm not sure Accord supports? It appears my only option would be to add new results to the learning set and completely rebuild the svm. Does anyone know of an alternative approach or library for incremental learning?

0

I usually just retrain everything with a certain lookback period (so when enough has filled up, the old data outside lookback gets truncated).

In some cases you can do incremental fitting in Accord by just running a single (or otherwise few) learning iterations on your model with a new piece of data, e.g. logistic regression can do this. However, then most of the fitting is most likely on the most recent data and there's not a lot of control over this.

Can't use Accord in QC cloud atm due to class load error but that's of course no problem if you're running locally. It's a shame because Accord is the only whitelisted library with SVM.

0

After googling a bit incremental learning of SVM is supposedly possible but difficult, and I haven't seen support in Accord for it.

1

Yes I've found some obscure options for incremental but think I will miss the rich features of accord. Besides, I expect there may be diminishing returns in prediction accuracy as the lookback grows. There may be a threading solution to getting adequate backtest performance with frequent model retraining.

0

One thing I've considered is to simply have models retrain on a background thread on increasingly extensive data (e.g. increasing lookback) until a deadline or when the main thread submits a new data set. That is similar to iterative deepening concept in game AI. However, I'm typically wary of doing something in a backtest that will work differently when running live (e.g. a backtest with once a day retraining would quickly cut off training, whereas live version would probably reach maximum lookback).

And yes, the lookback is a hyperparameter that's likely to have a large impact on the model's accuracy in practice on live data, and what's worse, in most cases the best lookback varies over time...

1

It would probably be possible to do like this however: Retrain model with fixed lookback, first time, wait for it to finish on main thread, after that, let main thread continue with the oldest finished model (just update the next training set). So in a backtest one would be using outdated models with probably worse performance than the live version which would have more recently trained models.

1

Have to agree that a background learning task is a minefield. In theory a less delayed lookback will lead to more accurate prediction, but it might equally be that the more recent signals are the noise of indecision that comes before a significant move. It's only standing on steady ground to have backtest behaviour that you're confident will reproduce. The problem is that regardless of the trade frequency relearning a significant set is simply too slow to backtest. Of course caching is an option, but then one tweak here or there and you need to rebuild your model cache.

0

Somewhat predictably, using a simpler normalized Bayesian algorithm performs acceptably to allow online re-learning of a set of adequate size. I imagine it's possible to optimize an svm to achieve better performance, but this is more a matter for the experts than for someone like myself that’s just dabbling to satisfy curiosity.

 

0

This paper has findings relevant to this topic:

http://www.jonathankinlay.com/Articles/ONE%20MILLION%20MODELS.pdf

In the study, they automated backtesting of a wide random selection of technical indicators and concluded that machine learning algorithms perform poorly for trend prediction on out of sample data and debunks papers with findings to the contrary as being subject to period fitting.

They do however, find that out of sample performance is superior for algorithms that focus on price, volume and order book data rather than technical indicator inputs.

0

      

Correction to previous post: the findings of the study were that volume was not very effective for predictions.

0

Thank you James, it's a nice and easy to read paper. IMO they're not using a good window period (1000 days) or sample weighting (uniform). Market rarely cares what happened 3 years ago...

0

      

I think the length of the training window may be to demonstrate that they have not period fit to a particular regime but that the model should be able to adapt to switches in trend.

I have this dilemma myself; that the length of the training window has a dramatic effect on results. My finding however is that the model tends to be struck with indecision with an overly long lookback.

Is there any consensus on methods to determine a valid range for the lookback period? I just rely on maximizing Sharpe.

0

I'm convinced one actually should fit to the most recent data and regime. They've certainly managed to show that the opposite didn't work for them, at least. My highly informal experience is that in the context of daily bars, looking more than a few months back is detrimental to most signals. Yeah, there won't be a lot of training data - that is a necessary problem, for if it was not, someone would already be detecting and trading away the pattern.

I'm not aware of what the consensus to determining lookback periods is (if any), I try to determine it using typical parameter optimization to maximize CAGR/MaxDD. I should add I'm not live trading any of the algos I've tried so I'm not an authority on the subject.

 

0

Awesome discussion!

First I’d like to focus on the million backtest paper. I really like the first part, where they test the use of SVM.

But I have serious doubts about the second part, the one where they run 1,000,000 backtests. 11Ants is a ML software specialized in customer loyalty. And the model ranking is based in a black box "combination of goodness of fit measures”. I find kind of disappointment the difference between the care and details if the first part with the bulk-I-have-no-idea-what-happens-here second part.

As many paper in this subject, the critical element is assumed to be correct, the backtesting. In this case there is a subtle detail that shows some looking ahead bias in the second part backtests. They standardized the data by “subtracting the in-sample mean and dividing by the standard deviation, also estimated from the in-sample data.” So, in the out-of-sample backtest, they should use the whole out-of-sample data to standardize the first observation. I’m not sure if this detail is significant but, in principle, shows that the experiment design has some flaws. Finally, I think everyone who uses ML techniques knows that in-sample performance is always better than out of sample.

Now respect to the training window, I recall a quote from the Ehlers’s book Cycle Analytics for Traders where he compares the prices to the meander of a river:

Viewed as an aerial photograph, every river in the world meanders. […] Every meander in a river is independent of other meanders, and thus all are completely random. If we were to look at all the meanders as an ensemble, overlaying one on top of the other like a multiple-exposure photograph, the meander randomness would also become apparent. […] However, if we are in a given meander, we are virtually certain of the general path of the river for a short distance downstream. The result is that the river can be described as having a short-term coherency but is random over the longer span.

River meanders are the kind of cycles we have in the market. We can measure and use these short-term cycles to our advantage if we realize they can come and go in the longer term.

Those are my two cents.

0

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.


I see what you mean about the two parts of the paper. They methodically demonstrate that most ML studies are not conscientious in their out of sample testing, which is a truism to the point of cliche. It may be that it's ill-advised trying to proceed from this kind of common knowledge towards an induction proof of the wider inadequacy of a whole breed of models; the problem arena is simply too vast to be captured by any set of experimental conditions. I'm curious what aspects of their experimental bias meant that only price data was an effective indicator. Sounds like you're suggesting that the cohort ranking is questionable, and the other thing out of place is the lookback period. It would still be interesting (if possible) to correct the bias of the experimental conditions in order to rank technical indicators by their predictive power.

"having a short-term coherency but is random over the longer span"

This is interesting as several of the ML papers I've come across conclude (from their in-sample testing) trend prediction is far more accurate on longer time scales than short. I suppose you could attribute this to the greater ease of curve fitting with aggregated bars over long timescales. This brings us back to random meandering: the effective strategies at short timescales proceed from the acceptance that price is analogous to a stochastic process which is coherent only in retrospect.

0

James Smith  you nailed it! That’s what I meant. But after a second review I realized the paper is from 2011, maybe then the insight was more informative than now.

Sounds like you're suggesting that the cohort ranking is questionable.

In some sense yes, is a black box. The authors said nothing about how the ranking is defined.

What continues is a bar conversation:

Respect to the short term coherency, I can’t find an article I saw time ago about the behavior of the prices at different time frames. I can’t even recall if the test was unit-root, randomness (kind of Kolmogorov-Smirnov) or returns autocorrelation. But the findings were that with lower time frames the prices show more coherency (or less randomness), thus is potentially more exploitable.

In the same line of reasoning, some days ago Patrick Star said in this post:

I was finally able to test some of my best performer algorithms with Minute and Second and I must say TA works just fine.

Finally,

…the effective strategies at short timescales proceed from the acceptance that price is analogous to a stochastic process which is coherent only in retrospect.

I love it! That’s why I think the best ML technique to explore is reinforcement learning, we need to teach a AI to safely gamble ;)

0

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.


Yes, again as I mentioned there, your biggest enemy is news (in general, news, events, reports, or any external happenings). In daily data there is a long period of time between two bars that can easily imact the price drastically while the market is not open. A conflict, a political issue, a diplomatic agreement, an election, passing a new law in other countries, a report, that your machine does not know about but they are significant to humans. There is no way for any ML algorithm to predict how impactful any of these events are to the next day's open price just by looking at the past. Especially when even two very experienced, well aware traders could interpret the same event in two completely different ways!

1

Hi,

@Gene Wildhart (and others), Can you implement your algo in Python?

Is there any example algo with machine learning in Python here at QC?

Are we able to import sk-learn library in Python?

Thanks :) 

0

Update Backtest





0

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.


Loading...

This discussion is closed