How to unit test, the right way

It’s important to understand not only what unit testing is, but how to implement unit tests correctly. Many developers live their lives going through the motions, scraping by — but not you. No, you’re a cut above the rest; on a quest to write unit tests that are the envy of your colleagues. Sound like you? Keep reading. Not resonating? keep reading anyway — I’m sick of cleaning up after you. 🙂

Unit testing is the process of breaking up a program into components to test individually. This can save many hours of manual QA before release. With proper unit tests, you can be sure the program is still working as expected, even after sweeping code changes.

At the start, unit tests can feel like a waste of time. It’s true, doing them properly will require you to write more code, but I promise the time you’ll save in the long run far outweighs the time spent writing tests.

Defining Scope

A common dilemma when first learning: properly defining scope. It’s easy to make the tests either too broad or too specific. The guideline I found to be the most useful? Think of the test as documentation. When writing a function, it’s common to describe its inputs and expected outputs. Unit tests verify this input/output relationship. They should not care how the insides of the method work, so long as it yields the expected result.

In general, a unit test should:

  • Test only a single method
  • Provide specific arguments to the method
  • Verify the result

Let’s dive deeper into each of these points.

Testing a single method

This is exactly what it sounds like. A test should target a single method. Take a look at the following pseudo code.

def test_add():
    assertEqual(8, add(5,3))

The snippet above presents a simple test for a function called add. If we want to make the test more robust, append lines to the existing test function, do not needlessly create another test unless there is a clear difference. For instance, specific edge cases are commonly factored into their own testing method. When incorporating bug fixes, it’s also common to add a specialized test case that references the issue to safeguard against repeating past mistakes.

This is good:

def test_add():
    assertEqual(8, add(5, 3))
    assertEqual(-1, add(1, -2))
def test_mult():
    assertEquals(10, mult(5, 2))

In contrast, this is generally frowned upon:

def test_add_simple():
    assertEqual(8, add(5, 3))
def test_add_negative():
    assertEqual(-1, add(1, -2))

More than one test for a single method. As mentioned above, an exception could be made if the negative instance is addressing a specific issue in a bug tracker. In that case, a comment should be added with a link to the issue or bug id.

The below snippet is also bad… more than one method tested in a single unit test.

def test_math_functions():
    assertEquals(8, add(5, 3))
    assertEquals(-1, add(1, -2))
    assertEquals(10, mult(5, 2))

Provide specific arguments the method

Do not generate a unique argument every time the test is run. Hard coding is not only okay, but often encouraged (in this context, don’t get carried away)! It’s imperative that unit tests are deterministic. If method arguments are generated on the fly, one developer may get an error while another passes every test. Most important takeaway: testing arguments should be constant across every instance of the program. Let’s look at an example.

Suppose you have a faster way to implement len called my_len() that you want to test.

This is good:

def test_my_len():
    assertEqual(3, my_len("abc"))
    assertEqual(0, my_len([]))
    assertEqual(3, my_len([1,2,3]))

This is bad:

def test_my_len():
    l = generate_random_list()
    assertEqual(len(l), my_len(l))

Different code paths could be tested depending on the list that’s generated.

Another cannon, always use the least amount of assert statements possible for full coverage. If my_len treats “abc” and “abcd” exactly the same way internally, there is no reason to write two assert statements, just pick one. On the other hand, if the method has a specific if statement to check for a null argument, then absolutely include that.

This is good:

def test_my_len():
    assertEqual(3, my_len("abc"))
    # Check for defined edge case
    assertEqual(-1, my_len(None))

This is bad:

def test_my_len():
    assertEqual(3, my_len("abc"))
    assertEqual(4, my_len("abcd"))
    assertEqual(5, my_len("abcde"))

There’s no reason to believe “abc” would pass and “abcd” would fail. The same code path is being tested multiple times. This is wasteful.

Verify the result

Not much to say here. Like above, the result should also be deterministic. One stylistic note is worth mentioning. Most unit testing frameworks expect assertEquals(...) to have the expected result as the first argument, and the test result as the second.

assertEqual(expected, actual)

Of course, the tests will work if this is backwards, but your coworkers may scoff.

Testing for exceptions

Testing for exceptions is equally important. If the method is expected to throw an exception with certain arguments, test it! The implementation varies by language. Typically it looks something like this:

assertRaises(IllegalArgumentException, is_numeric(null))

The test case will only pass if an exception is thrown. If a different exception occurs or it returns normally, then it will fail — alerting you to the problem.

Test Driven Development

I firmly believe all projects should use unit tests. This, however, is not what test driven development is.

Test driven development refers to the practice of writing unit tests before the method implementations. It may seem backwards at first, but there are a few key advantages.

  • Start thinking about the edge cases early
  • Forces the spec to be strictly defined in advance
  • Impossible to “forget” to write those unit tests 🙂

Side note: this is becoming more and more common in education. Students get instant feedback on how they’re progressing on an assignment and practice with industry methodology. Unit tests can easily be included in Jupyter Notebooks for the ultimate teaching tool.

Concrete Examples by Language

These are brief examples meant to serve as a reference.

For demonstration purposes, each illustration assumes you want to test a class called Palindrome. The class has a function called is_palindrome() that takes a single string argument and returns a boolean.

Unit Testing in Python

Python has dozens of unit testing libraries. We’re going to stick to the aptly named unittest because it’s built into Python. If you want an even simpler alternative, pytest may be worth a look.

The Palindrome class in palindrome.py.

class Palindrome():
    @staticmethod
    def is_palindrome(s):
        return s == s[::-1]

To create the unit test, start with a new file test.py.

import unittest
from palindrome import Palindrome as p

class TestPalindrome(unittest.TestCase):
    def test_is_palindrome(self):
        self.assertTrue(p.is_palindrome("racecar"))
        self.assertFalse(p.is_palindrome("thisisfalse"))
   
if __name__ == "__main__":
    unittest.main()

Simply run the test file to see the results.

$ python test.py
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Unit Testing in Java

JUnit is the defacto standard in Java. First, you will need to add the JUnit dependency to the project. Here’s the quick version for Eclipse:

  1. Right click the Java project, open Project Properties
  2. Build Path > Configure Build Path
  3. Click the Libraries tab
  4. Click Add Library… on the right side
  5. Select JUnit from the list and hit next
  6. Ensure the newest version is selected in the drop down
  7. Hit Finish followed by OK

IntelliJ IDEs offer to include JUnit automatically when you start writing the test. Simply accept.

With the JUnit dependency added, create a new class called PalindromeTest. Note the special @Test annotation.

import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
// Import the class to be tested
import com.technohedge.example.Palindrome.isPalindrome;

public class PalindromeTest {
    @Test
    public void isPalindromeTest() {
        assertEquals(true, isPalindrome("racecar");
        assertEquals(false, isPalindrome("abc");
    }
}

Altman Z Score – Determining Bankruptcy Probability with QuantConnect

The Altman Z-Score is an indicator used to determine a company’s likelihood of declaring bankruptcy. A total of five ratios are necessary for the calculation. Lucky for us, they are all readily available for public companies.

The Formula

Let:
A = Working Capital / Total Assets
B = Retained Earnings / Total Assets
C = Earnings Before Interest / Total Assets
D = Market Value of Equity / Total Liabilities
E = Sales / Total Assets

Then the Altman Z Score can be calculated by:
Z = 1.2A + 1.4B + 3.3C + 0.6D + 1.0E

The relative probability of default is determined by the Z value. Specifically,

Z ≥ 3 → Safe
1.81 ≤ Z < 3 → Warning
Z < 1.81 → Danger

Note that these cutoffs are from the original Altman Z Score. Different intervals have been derived for emerging markets. More information is available on Wikipedia.

Algorithm

This algorithm is heavily based on code from Aaron Gilman. It has been updated to work with new versions of QuantConnect.

It works through universe selection. Universe selection allows us to filter equities based on predefined search criteria. In this case, it selects equities that have 1) all the necessary data available for calculating the ratios and 2) a Z Score greater than 1.81. Next, the results are sorted by EBITDA and capital is equally divided among the top 100 equities. The portfolio is re-balanced on the first trading day each month.

Historic Accuracy

In Altman’s initial publication, the Altman Z Score was 72% accurate in predicting bankruptcy within two years. False negatives, however, were extremely low at just 6%. This initial accuracy has not only been proven, but actually found to be a conservative estimate. Over the years, Altman’s model was found to be 80-90% accurate — but with a higher false negative rate of around 15%.

Today, Altman’s Z Score is widely accepted. Originally designed for manufacturing companies with over $1 million in assets, it’s now used in a variety of countries and industries, though sometimes with slight modifications.

Caveats

As with most balance sheet models, the Alman Z Score should not be applied to financial companies. The balance sheets of Wallstreet companies are notoriously opaque and off-balance sheet items are numerous — making accurate calculations nearly impossible.

Jupyter Notebook: Getting Started and Installation

Jupyter Notebook provides a simple way to run code in Python, R, Scala and more. While it’s mainly used in research related fields, Jupyter can be applied to a wide variety of applications.

Jupyter is especially useful for two groups in particular. Data scientists and machine learning experts benefit from modular execution. They can load a large data set once and try many different experiments/code changes; saving an enormous amount of time in the process. Academics make up the other group. Documenting work as you go could not be easier. Professors can auto grade students work using tools like Vocareum, built to work with Jupyter.

Above is a screenshot directly from a notebook. It runs in a web interface and is quite easy to use. Markdown syntax can be added to create documentation blocks — much more expressive than the brief inline comments we programmers often get used to.

Installing Jupyter Notebook

Some prefer the native installation while others like to keep everything in a self-contained Docker container. I will outline both methods — choose the one that works best for you.

Native

Jupyter Notebook is easy to install with pip. I assume you already have Python and pip installed. If that’s the case, simply run

python3 -m pip install jupyter #for Python3
python -m pip install jupyter  #for Python2

Congratulations, that’s it! To run Jupyter, simply open up a new terminal in the directory you want the notebooks to be saved. Then type:

jupyter notebook

Your default browser should automatically open to the Jupyter instance.

Docker

This guide assumes Docker is already installed. If you’re unfamiliar with Docker, please check out their guide.

Using Jupyter with Docker is easy, a container is already maintained. Simply run the container with the following command:

docker run -it -v /path/to/jupyter/directory:/work --net=host --rm jupyter/all-spark-notebook

The -v flag is used to share a local directory with Docker. /path/to/jupyter/directory should point to a local directory where you want the notebooks to be saved. When inside the Jupyter instance, be sure to save everything inside the /work directory, or it will not be saved.

Once the Docker container is launched, a unique URL will be printed to the console. Copy and paste that into a web browser and you’re good to go! You may notice the example below says 127.0.0.1 or a seemingly random string of numbers. If this is the case for you, be sure to substitute 127.0.0.1. For instance, open a web browser and go directly to the URL http://127.0.0.1:8888/?token=f19d2097c0e3455e3589c985b182b93a6c9f022612d9a2cc. Your token will be different!

Your first Jupyter project

With Jupyter successfully installed, your screen should be similar to the one pictured below.

Once Jupyter Notebook is installed and launched, we can create our first actual notebook. If you installed with Docker, be sure to click on the work directory first. Then click the New dropdown, then select Python 3 (or a different language if you prefer).

All that remains is filling it with content. Each content block is a cell. There are two main types of cells: code and Markdown.

Let’s create a new code cell.

Run the code in a cell by clicking the play button with the cell selected, or by hitting ctrl + enter on the keyboard.

Now, let’s demonstrate one of the main benefits of Jupyter. Say we need to load data, which takes a lot of time. If this was a standard Python script, the data would have to be loaded during each subsequent run. This is not the case with Jupyter. We can simply put the data loading code into its own cell.

Saving & Checkpoints

To save your work, simply click the Save & Checkpoint button.

As the name suggests, manually saving also creates a checkpoint. Checkpoints are a form of basic version control — you can easily roll back to any checkpoint later on. Work in a notebook will also be periodically auto-saved, but checkpoints must be created manually.

To share your work with someone else, simply send them the .ipynb file. They can launch the file using their own Jupyter installation and pick up right where you left off. While you can use Jupyter with a version control system (like Git of Mercurial) by checking in the .ipynb files, it’s not easy to see the individual code changes later. This is my single biggest complaint when using Jupyter. If anyone has found a solution, I would love to hear it!

What is Node.JS, really?

You’ve heard of Node.JS (probably) but what exactly is it? Should you care? There has been a lot of buzz around Node.JS lately — and there’s traction to back it up. Some major companies have adopted the framework including PayPal, LinkedIn, Netflix, Uber, eBay, and many more.

As the name implies, Node.JS is powered by JavaScript. In brief, its an event driven framework that competes with the likes of PHP, Django, and other web technologies.

What is it used for?

Really, Node.JS can be used for just about anything. From dev tools to production deployments. Since Node runs server-side JavaScript, it’s just as capable as any other language. Support for reading ports and files — functionality that is usually restricted when running in the browser, is now available at your fingertips.

Realistically speaking, when people refer to Node.JS, they typically mean Node.JS + Express, essentially a web server. The entire stack can be handled this way, removing the need for Apache or NGINX. This framework can be a great choice for real-time applications and building a custom API.

What makes it so popular?

JavaScript is arguably the most popular language on the planet. Many programmers have a basic familiarity with the language; but the benefits don’t end there. Over 73% of websites rely on JavaScript for important functionality. It’s used to create the beautiful, interactive experiences we’ve grown accustomed to. Traditionally, this creates a divide between front end and back end development. While back end developers learn PHP, Java, or, gasp, C, front end developers learn JavaScript, HTML, and CSS. Not with Node.JS. Both front end and back end development can be done entirely with JavaScript. This simplifies the stack and eliminates impedance mismatch.

The need… the need for speed! Node relies on Chromium’s V8 engine. This means the JavaScript doesn’t stay as raw (and potentially slow) JavaScript. Instead, it’s compiled into machine code, much like C would be. This has huge implications for both performance and efficiency of the application. An uncorroborated post claims Walmart’s overall CPU usage never exceeded 1% after switching to Node.JS, even with over 200 million daily users.

A thriving community. Community support is truly top notch. Tutorials, guides, and troubleshooting information is available in abundance. The package manager, NPM is also top notch. Tracking and installing project dependencies could not be easier. Want bootstrap? Easy, npm install bootstrap. Similar to pip’s requirements.txt, you can create a config.json file outlying all the dependencies. Once complete, a simple npm install will ensure everything is ready to go.

Those of you that prefer NoSQL like databases can rejoice. MongoDB (and similar) are commonly used within a Node application and support is prolific. Object Role Modeling is quickly becoming the preferred method to develop in Node.JS — but not to worry, those that prefer standard relational databases have plenty of support too.

What’s the catch?

Node.JS is heavily event driven. I consider this both a pro and a con. Event driven programming (more on this in the next section) can be tricky at first and bugs can be hard to track down.

JavaScript doesn’t have a standard library. Sure, there are community packages for just about anything, but there’s not one package but six or more. Choice is not always good, with six ways to do things, there’s often 5 ways to do it incorrectly. The default packages included with Node.JS can even be replaced if you’re unsatisfied.

Production environments are much more complex than standard Apache/NGINX setups. Error handling is essential, since just one bug will crash the entire process. To utilize multi-threaded systems, one server should be started for every thread. This necessitates a local load balancer to share the same port and a method to cluster the separate instances.

Can we address this “event driven” thing?

This is best illustrated by analogy. Dan York has an excellent article explaining the event driven model. In his post he compares the situation to ordering fast food. In a traditional thread based model, one person would get to the front of the line, place an order and stand around waiting until his food was prepped; holding up everyone behind him. In contrast, an event based model would order the food, then step aside until he’s notified that his order is up. This way, the patron behind him can place an order immediately.

Here’s some pseudo code to illustrate the point. A traditional thread based model might look something like this:

var currentUser = db.getUser(userId);
console.log(currentUser);
doSomethingElse();

In contrast, an event driven model uses callbacks.

db.getuser(userId, function(user, error){
    if (!error) {
        console.log(user);
    } else {
        console.log(error.message);
    }
};
doSomethingElse();

The anonymous callback function is called only after the user information is retrieved from the database. In the event driven model, it’s likely doSomethingElse() will be executed before logging the user information. In the thread based model, of course, this would never happen. We’re stuck “waiting in line” for the database call (the thread blocks) before continuing with the program’s execution.

Do you plan on using Node.JS for your next project? Wish your company would make the switch? I’d love to hear your thoughts!

Building your first algorithm in QuantConnect (Python)

This post will guide you through developing your very own trading algorithm in QuantConnect. A familiarity in python and basic finance knowledge is assumed, but I’ll be gentle — promise! Already an expert? Skip to the code.

More comfortable with C#? View the alternate tutorial.

The algorithm we’ll build is based on the principle of a proportionated simple moving average (P-SMA). We will choose a benchmark (SPY in this example) and, based on its simple moving average, decide if the market will go up or down. If we predict the market will go up, we will invest in equities that provide fast growth but increased risk. Otherwise, we invest in safe assets, such as treasury bonds. Proportionated means the decision is not binary. For example, we may calculate 30% of our portfolio should be relatively risk-less and allocate 70% for high growth equities.

Time to code! Any algorithm in QuantConnect starts the same way:

class ProportionalSMAFast(QCAlgorithm):
    def Initialize(self):
        pass

First, we instantiate the class. The name can be anything you like, but it’s important to extend QCAlgorithm. Whenever an algorithm is started, Initialize is called exactly once and allows us to setup the properties of our algorithm. Let’s begin to flesh out initialize.

def Initialize(self):
    self.SetCash(10000)

    self.SetStartDate(2016,01,01)
    self.SetEndDate(2016,10,14)

    # Add all assets you plan on using later
    self.spy = self.AddEquity("SPY", Resolution.Daily).Symbol
    self.qqq = self.AddEquity("QQQ", Resolution.Daily).Symbol
    self.tlt = self.AddEquity("TLT", Resolution.Daily).Symbol
    self.agg = self.AddEquity("AGG", Resolution.Daily).Symbol

    self.benchmark = self.spy

    self.risk_on_symbols = [self.spy, self.qqq]
    self.risk_off_symbols = [self.tlt, self.agg]

Methods such as SetCash, SetStartDate, and SetEndDate are only applicable when running a back-test. They are completely ignored during live trading.

AddEquity is essential to any algorithm you write. By adding the equity in the initialize method, the relevant equity data will be made available throughout your algorithm. Resolution.Daily specifies data will be given with a daily window. Other options are tick, second, minute, and hour.

So what about the .Symbol and self.my_equity? Why assign the variable? This is not strictly necessary. Specifically, the following code is all that is required to make the equity data available.

self.AddEquity("SPY", Resolution.Daily)

By assigning self.spy, we can prevent hard coding the string "SPY" everywhere and use the variable instead. If you’re a little confused about this point, don’t worry. It will become more apparent later on.

And the final two lines? Remember, we want to invest in either high growth or low risk assets depending on the market. risk_on_symbols will be invested when we want to add risk to our portfolio — predicting an upswing. risk_off_symbols are our low risk investments. You should feel free to experiment with different symbols. You can add as many or as few equities as you like to either list.

#Schedule every day SPY is trading
self.Schedule.On(self.DateRules.EveryDay(), \
                 self.TimeRules.AfterMarketOpen(self.benchmark, 10), \
                 Action(self.EveryDayOnMarketOpen))

The snippet above will complete our Initialize method. This is the main driver of your algorithm. It schedules a method called EveryDayOnMarketOpen to run every day that SPY (our benchmark) is trading, 10 minutes after market open.

Since setup is over with, let’s move on to the heart of the algorithm by defining EveryDayOnMarketOpen.

def EveryDayOnMarketOpen(self):
    #Do nothing if outstanding orders exist
    if self.Transactions.GetOpenOrders():
        return

Nothing groundbreaking here. We just return immediately if there are any open orders. In theory, this should never happen. Our algorithm will submit market orders 10 minutes after market open, and is run once per trading day. If this block does execute, it’s likely an indicator of a more serious, underlying problem. Nevertheless, better safe than sorry.

#Lookup last 84 days
slices = self.History(self.spy, 84)
#Get close of last (yesterday's) slice
spy_close = slices["close"][-1]

#Get mean over last 21 days
spy_prices_short = slices["close"][-21:]
spy_mean_short = spy_prices_short.mean()

#Get mean over last 84 days
spy_prices_long = slices["close"]
spy_mean_long = spy_prices_long.mean()

The self.History method returns a pandas data frame, representing data on the specified equity for the previous 84 days. Our algorithm compares moving averages over two different window sizes, 21 and 84 days. These are arbitrary (but common) intervals. I encourage you to experiment by changing these values. The next two blocks splice the last 21 and 84 closing prices from the data frame and calculate the average.

risk_on_pct  = (spy_mean_short/spy_close) * \
               ((spy_mean_short *2 / spy_mean_long) *.25) / \
               len(self.risk_on_symbols)
risk_off_pct = (spy_close/spy_mean_short) * \
               ((spy_mean_long *2 / spy_mean_short) *.25) / \
               len(self.risk_off_symbols)

#Submit orders
for sid in self.risk_on_symbols:
    self.SetHoldings(sid, risk_on_pct)
for sid in self.risk_off_symbols:
    self.SetHoldings(sid, risk_off_pct)

Finally, the exciting stuff! The “risk on” and “risk off” percentages are calculated using our history data. self.SetHoldings will allocate a percentage of your portfolio to the specified equity. For instance, self.SetHoldings("SPY", 1) will buy as much SPY as you can afford, 100% of your portfolio. If you have a margin account and want to leverage your position, simply allocate more than 100%. self.SetHoldings("SPY", 2) will buy twice as many SPY shares as you can actually afford.

That’s it! You now have an algorithm that can trade automatically on your behalf. I encourage you to experiment changing/improving the algorithm on your own.

This example is also available on GitHub.

Building your first algorithm in QuantConnect (C#)

This post will guide you through developing your very own trading algorithm in QuantConnect. A familiarity in C# and basic finance knowledge is assumed, but I’ll be gentle — promise! Already an expert? Skip to the code.

More comfortable with Python? View the alternate tutorial.

The algorithm we’ll build is based on the principle of a proportionated simple moving average (P-SMA). We will choose a benchmark (SPY in this example) and, based on its simple moving average, decide if the market will go up or down. If we predict the market will go up, we will invest in equities that provide fast growth but increased risk. Otherwise, we invest in safe assets, such as treasury bonds. Proportionated means the decision is not binary. For example, we may calculate 30% of our portfolio should be relatively risk-less and allocate 70% for high growth equities.

Time to code! Any algorithm in QuantConnect starts the same way:

namespace QuantConnect.Algorithm.CSharp{
    public class ProportionalSimpleMovingAverage : QCAlgorithm{
        public override void Initialize(){
            return;
        }
    }
}

First, we instantiate the class. The name can be anything you like, but it’s important to extend QCAlgorithm. Whenever an algorithm is started, Initialize is called exactly once and allows us to setup the properties of our algorithm. First, we’ll define some member variables. This should go immediately before Initialize, inside the class.

...
private static Symbol _spy = QuantConnect.Symbol.Create("SPY", SecurityType.Equity, Market.USA);
private static Symbol _qqq = QuantConnect.Symbol.Create("QQQ", SecurityType.Equity, Market.USA);
private static Symbol _tlt = QuantConnect.Symbol.Create("TLT", SecurityType.Equity, Market.USA);
private static Symbol _agg = QuantConnect.Symbol.Create("AGG", SecurityType.Equity, Market.USA);
   
private Symbol _benchmark = _spy;
   
private List<Symbol> _risk_on_symbols = new List<Symbol>{
    _spy,
    _qqq
};  
private List<Symbol> _risk_off_symbols = new List<Symbol>{
    _tlt,
    _agg
};  
   
private RollingWindow<decimal> _close_window;

public override void Initialize(){
    ...

When coding in C#, a common convention is to denote member variables with an underscore. The first four lines create the “symbol”, a reference to specify the desired equity. Next, we define our “benchmark”. This is the equity that will be used as the basis for all future calculations. Moving on to the lists: remember, we want to invest in either high growth or low risk assets depending on the market. _risk_on_symbols will be invested when we want to add risk to our portfolio — predicting an upswing. _risk_off_symbols are our low risk investments. You should feel free to experiment with different symbols. You can add as many or as few equities as you like to either list. The final member variable is_close_window. This is a rolling window — a special list that will only keep a fixed number of the most recent elements. Specifically, we’ll be using a window size of 84, and storing decimal numbers (daily closing price of the benchmark equity); because the window is not static, we declare it here but instantiate the window later inside the Initialize method.

Let’s begin to flesh out initialize.

public override void Initialize(){
    SetStartDate(2016, 01, 01);
    SetEndDate(2016, 10, 14);
    SetCash(10000);

    AddEquity(_spy, Resolution.Daily);
    AddEquity(_qqq, Resolution.Daily);
    AddEquity(_tlt, Resolution.Daily);
    AddEquity(_agg, Resolution.Daily);
}

Methods such as SetCash, SetStartDate, and SetEndDate are only applicable when running a back-test. They are completely ignored during live trading.

AddEquity is essential to any algorithm. By adding the equity in the initialize method, the relevant information will be made available throughout the algorithm. Resolution.Daily specifies data will be given with a daily window. Other options are tick, second, minute, and hour.

_close_window = new RollingWindow<decimal>(84);
IEnumerable<TradeBar> slices = History(_benchmark, 84);
foreach(TradeBar bar in slices){
    _close_window.Add(bar.Close);
}

The code here is responsible for initializing the rolling window. The first line creates the window and sets the type and size to decimal and 84 respectively. Next the History function is called to get historical information about the benchmark during the last 84 days. Finally, we loop through every day of historic data and add the closing price to the rolling window. This ensures the algorithm will always have 84 days of information to use for calculations.

Schedule.On(DateRules.EveryDay(_benchmark),
        TimeRules.AfterMarketOpen(_benchmark, 10),
        EveryDayOnMarketOpen);

The snippet above will complete our Initialize method. This is the main driver of your algorithm. It schedules a method called EveryDayOnMarketOpen to run every day that SPY (the benchmark) is trading, 10 minutes after market open.

Next, we build a simple function to assist with calculating averages over various window sizes.

private decimal GetRollingAverage(int n, RollingWindow<decimal> window){
    decimal sum = 0;
    for (int i = 0; i < n; i++){
        sum += window[i];
    }  
       
    return sum / n;
}

This function accepts two parameters, an integer n denotes the number of days to look back, and a RollingWindow over which to do the averaging. The for loop will iterate through the window until it reaches the nth entry. At that point, the sum will be divided by n and the simple average will be returned.

Since setup is over with, let’s move on to the heart of the algorithm by defining EveryDayOnMarketOpen.

public void EveryDayOnMarketOpen(){
    if (Transactions.GetOpenOrders().Count > 0){
        return;
    }  
}

Nothing groundbreaking here. We just return immediately if there are any open orders. In theory, this should never happen. Our algorithm will submit market orders 10 minutes after market open, and is run once per trading day. If this block does execute, it’s likely an indicator of a more serious, underlying problem. Nevertheless, better safe than sorry.

IEnumerable<TradeBar> slices = History(_benchmark, 1);
TradeBar last_bar = slices.Last();
decimal bench_close = last_bar.Close;
       
_close_window.Add(bench_close);
       
decimal bench_mean_short = GetRollingAverage(21, _close_window);
decimal bench_mean_long = GetRollingAverage(84, _close_window);

The History method returns TradeBars, representing data on the specified equity. We use the History function to get yesterday’s closing price and add it to the rolling window. The final two lines calculate the average closing price over the specified period. Our algorithm compares moving averages over two different window sizes, 21 and 84 days. These are arbitrary (but common) intervals. I encourage you to experiment by changing these values. Note that if you want to look back past 84 days the rolling window size will need to be increased in the Initialize function. QuantConnect does offer convenience methods to calculate various indicators to use in your projects, which makes code easier to read.

decimal risk_on_pct = (bench_mean_short / bench_close) *
                        ((bench_mean_short * 2m / bench_mean_long) * .25m) /
                        _risk_on_symbols.Count;
decimal risk_off_pct = (bench_close / bench_mean_short) *
                        ((bench_mean_long * 2m / bench_mean_short) * .25m) /
                        _risk_off_symbols.Count;

foreach (Symbol sid in _risk_on_symbols){
    SetHoldings(sid, risk_on_pct);
}  
foreach (Symbol sid in _risk_off_symbols){
    SetHoldings(sid, risk_off_pct);
}

Finally, the exciting stuff! The “risk on” and “risk off” percentages are calculated using our history data. SetHoldings will allocate a percentage of your portfolio to the specified equity. For instance, SetHoldings(_spy, 1) will buy as much SPY as you can afford, 100% of your portfolio. If you have a margin account and want to leverage your position, simply allocate more than 100%. SetHoldings(_spy, 2) will buy twice as many SPY shares as you can actually afford.

That’s it! You now have an algorithm that can trade automatically on your behalf. I encourage you to experiment changing/improving the algorithm on your own.

This example is also available on GitHub.

A Hands-On Introduction to Machine Learning

First, let me begin by setting some expectations. This is not a guide for the hardcore ML researchers out there. This is meant to be a practical introduction to machine learning that any computer scientist can follow, without much prior knowledge of the ML domain. I feel that there are many guides that focus solely on the academics of ML, but neglect to mention how simple it is to apply towards real life applications. Even naive approaches are often surprisingly effective.

To get started, we need to install scikit and other dependencies.

pip install numpy scipy scikit-learn

For simple implementations like we’ll see today, most of the challenge revolves around data prep. Our first step is to download the JSON training data from Kaggle. If you don’t already have an account, you’ll need to create one now. Once the account is created, visit the “What’s Cooking?” competition page and select the data tab to access the downloads.

Next, we need to format the data. A CSV format will be used, with each column representing a different ingredient and each row a single recipe. In ML lingo, the ingredients are features, the data we use as a basis for the predictions. Labels, the answer to each recipe, will be generated similarly. There’s nothing too novel with the code here, just some data wrangling.

Create a file called parser.py with the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#!/usr/bin/env python3

# Enable python2 compatability
from __future__ import print_function

import json

def main():
    # Define input/output file names
    train_file = "train.json"
    test_file = "test.json"
    train_file_out = "train.csv"
    test_file_out = "test.csv"
    train_file_out_labels = "train-labels.txt"
    json_data = None
    with open(train_file, 'r') as f:
        json_data = f.read()
    train_obj = json.loads(json_data)

    # Empty arrays to hold information
    labels_train = []
    labels_test = []
    # ingredients is defined as a set to prevent duplicates
    ingredients = set()

    # Generate corresponding labels and simultaneously make
    # exhaustive set of all posible cuisines (labels)
    with open(train_file_out_labels, 'w') as f:
        for recipe in train_obj:
            label = recipe["cuisine"]
            print(label, file=f)
            labels_train.append(label)
            for ingredient in recipe["ingredients"]:
                ingredients.add(ingredient)

    with open(test_file, 'r') as f:
        json_data = f.read()
    test_obj = json.loads(json_data)

    # The test file may introduce ingredients not included in training set
    # This ensures they're included
    for recipe in test_obj:
        for ingredient in recipe["ingredients"]:
            ingredients.add(ingredient)

    # Transform set to list to ensure iteration order is constant
    ingredients_list = list(ingredients)

    # Generate the CSV files
    generate_csv_for_each_recipe(ingredients_list, train_obj, train_file_out)
    generate_csv_for_each_recipe(ingredients_list, test_obj, test_file_out)

def generate_csv_for_each_recipe(ingredients_list, json_obj, output_file):
    """
    Creates an output csv file with each ingredient being a column
    and each recipe a row. 1 will represent the recipe contains the
    given ingredient if the recipe includes that incredient, else 0

    ingredients_list -- the full list of ingredients (without duplicates)
    json_obj -- the json object with recipes returned by json.loads
    output_file -- the name of the generated CSV file
    """

    # Loop thru each recipe
    with open(output_file, 'w') as f:
        for recipe in json_obj:
            rl = set()
            first = True
            s = ""
            for ingredient in recipe["ingredients"]:
                rl.add(ingredient)
            # This builds the csv row of ingredients for current recipe
            # Add 1 for ingredient if included in recipe; else 0
            for j in ingredients_list:
                # Don't prepend "," for first item
                if first != True:
                    s += ","  
                else:
                    first = False

                # Add 1 or 0 as explained above
                if j not in rl:
                    s += "0"
                else:
                    s += "1"
            print(s, file=f)

if __name__ == "__main__":
    main()

Create a new python script called train.py. Add some simple imports and variables that will prove useful later.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

filename = "train.csv"
label_file = "train-labels.txt"
test_file = "test.csv"
prediction_output = "predictions.txt"

Load the data files generated previously with the parser script.

# Load the training features into a np array
features = np.loadtxt(filename, delimiter=',', dtype=np.uint8)
# Load the labels
with open(label_file) as f:
    labels = f.readlines()
# Strip any new line characters or extra spaces
labels = [x.strip() for x in labels]
# Convert to np array
labels = np.asarray(labels)

The next step is to split the training data into a training set and a testing set. This will allow us to estimate how well the classifier does on the “real” test data. Think about it, the testing data from Kaggle does not include the answers (labels). To have an easy way to see how well we’re doing, it’s necessary to split the data we do have answers for. It is not okay to test with the same features used in training — the accuracy will be artificially high. Scikit includes a handy feature to split the data for us.

# Split data up into training and test data
X_train, X_test, y_train, y_test = train_test_split(features, labels)

Instantiate and train the classifier with the split data set created from the last step. Selecting the best classifier is beyond the scope of this article, so we’ll just use the Logistic Regression classifier in this example, which performs pretty well. Scikit has an example testing different classifiers, if you want to explore.

print("Starting training...")
clf = LogisticRegression()
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("Model has accuracy of " + str(score * 100) + "%")

Let’s use the same classifier to make predictions over the Kaggle test set, the one we don’t know the answers to. We’ll format this as simply one prediction per line.

print("Predicting over the Kaggle test set")
test_data = np.loadtxt(test_file, delimiter=',', dtype=np.uint8)
predictions = clf.predict(test_data)

with open(prediction_output, "w") as f:
    for prediction in predictions:
        print(prediction, file=f)

The last script we’ll write takes the predictions created in the last step and formats it in the specific way Kaggle expects. This will let us see how we performed against other solutions to the What’s Cooking Challenge.

Create a new script called kaggle.py. As usual, import the required modules and define a few helpful variables.

#!/usr/bin/env python3

from __future__ import print_function
import json

predict_file = "predictions.txt"
test_file = "test.json"
output_file = "kaggle.csv"

Read the prediction file

with open(predict_file) as f:
    labels = f.readlines()
labels = [x.strip() for x in labels]

Open the Kaggle test file and parse as JSON

with open(test_file, 'r') as f:
    json_data = f.read()
obj = json.loads(json_data)

Open the output file for writing and format as the Kaggle spec requires.

with open(output_file, "w") as out:
    # Print CSV headers
    print("id,cuisine", file=out)

    i = 0
    # Iterate through each recipe in the test file
    # Follow the spec in CSV format,
    # the recipe id followed by the cuisine prediction
    for recipe in obj:
        idx = recipe["id"]
        ingredient = labels[i]
        print(str(idx) + "," + ingredient, file=out)
        i += 1

To see how well you did, submit the generated kaggle.csv file to the Kaggle competition.

The complete code is available on GitHub.

For such a naive solution, we did pretty well here — successfully classifying over 77% of the recipes. There is, of course, room for improvement. It’s unlikely you’ll top the leader board with ready-made classifiers, but it’s close enough for many real-life problems and an excellent start to a future in machine learning.

What is QuantConnect? A review.

Anyone with a strong interest in finance will eventually hear of “quants” — the mystical math prodigies behind many of the trades on Wall Street today. While Quantitative Analysts come in a few varieties, I find algorithmic traders the most exciting. They’re responsible for crafting sophisticated computer programs to automatically trade equities, currencies, futures, and really anything and everything else that can turn a profit. Traditionally, having the resources to trade live — with real money — has been restricted to multi-million (or billion) dollar hedge funds. QuantConnect is a platform that allows anyone to create live trading algorithms.

This means you can connect your Interactive Brokers/E-Trade/Gdax/whatever account to QuantConnect and build a simple program to trade on your behalf. Hell, you might create the next cutting-edge innovation, license it to a company for millions, and retire in the Caribbean.

So, without further ado: a hands on review on QuantConnect.

Interface

Three languages are available: C#, Python, and F#

The online IDE is intuitive. Upon starting an algorithm, three programming options are available: C#, Python, and F#. Initially, I felt the abundance of language options to be frivolous, but now I can’t imagine any other way. Find a great library in C# you’d like to add? Want to code some ML with SKLearn in Python? No problem. What really stands out is the ability to add multiple files. This is an enormous help when developing complex algorithms or importing third party libraries. It replicates a more natural coding environment without the usual limitations of an online interface.

Must-haves such as syntax highlighting and auto complete work brilliantly. Most of the time the prompt is automatic. Just like other IDEs, it can be manually invoked with Ctrl + Space.

Autocomplete works as expected

Still not a fan of developing in a web browser? Since the framework that runs QC is open source, it’s a breeze to get it running locally so you can use visual studio (or vim) — which is a good thing, because you’ll probably need it. As capable as the system is, it falls short on debugging. There is no interactive debugger, so you’ll be stuck relying on error messages that look like this:

BacktestingRealTimeHandler.Run(): There was an error in a
scheduled event EveryDay: SPY: 10 min after MarketOpen. The error was
Python.Runtime.PythonException: NameError : global name 'foobar' is not
defined

Notice something missing? There’s no line number! Leaving you stuck guessing exactly where the flaw lies. I should mention, this is only a problem with runtime errors. Any problems during compilation will be flagged and the line number identified.

Algorithms can be shared with email or account name

Collaborating with a friend? QC makes this a breeze. Simply hit share and you can work in tandem. Want to rename, move, or delete a project? All actions are as expected.

Overall, the interface does it’s job very well.

Speed

There’s two important aspects to “speed” during algorithm development. The time a backtest takes to complete, and the delay before receiving live data.

Backtesting speed is comparable to any other platform. You might expect to see large differences between an algorithm programmed in C# and an algorithm in Python. Empirical evidence suggests this is not the case. Runtimes were nearly identical across languages.

Language Average Time
C# 6.183333333
Python 6.76666666

Occasionally, there has been a wait time before starting a backtest; indicated in the results chart. This can be very annoying when trying to meet a deadline — but in my experience it’s a rare enough occurrence to not be a concern.

Results tab displays a wait time of 12

QuantConnect advertises co-located servers with minimal delay. A post from the founder on Reddit suggests speeds between 20 and 100ms. This may not be directly competitive with Wall Street firms, but it far exceeds any competition available for personal accounts. The delay is enough to be prohibitive for very high frequency trades, but is still impressive. Note that we took QuantConnect on their word here. These times were not tested.

Community

Small but mighty! In my experience, searching alone will seldom solve your problem — but people are happy to help. Posting a specific question on the forum typically yields a quick response. The QuantConnect team is also very active and helpful. They go as far as to offer free code reviews to interested users.

Price

If you’re only looking for a research platform, the price can be free. However, if you have absolutely no interest in live trading, there are plenty of alternatives. To take advantage of where QC really excels, the price starts at $20/month. Depending on the size of your portfolio, that may sound too good to be true, or a little steep. However, I’m confident even rather small portfolios will more than make up for the price with a good algorithm.

The real expense happens if you want to run multiple algorithms simultaneously. Each algorithm runs on its own, isolated VPS. This creates large reliability and performance benefits, but comes at a cost. Every additional algorithm is an added $10.

Bonus features

One unique feature QC has to offer is version control. Every time an algorithm is backtested, a snapshot is created. This allows for development without worry — knowing you can revert to a previous version at any time.

Alternatives

Quantopian, once a fierce competitor, has suspended all live trading. It is still a capable research platform, if you’re willing to forgo trading with real money.

IBridgePy is a self-hosted solution compatible with Interactive Brokers.

Most other alternatives are inaccessible to the average investor and targeted towards hedge funds. AlgoTrader is a prime example.

Bottom Line

QuantConnect is not without its problems — no platform is. However, it far exceeds anything offered by the competition and is improving everyday. At a competitive price with a tremendous staff, no one else comes close.

Oh, and if you do end up retired under 30, sipping Piña Colladas by the beach — don’t forget about the blogger that helped get you there. 🙂

NextCloud vs. OwnCloud: History & Feature Comparison

If you’ve done research into self-hosted cloud storage, there are two main contenders that typically pop up: Nextcloud and ownCloud. So which one’s right for you? What’s the real difference between the two? Let’s find out.

For the impatient among you, I’m going to get right to the point. Nextcloud is the superior option in nearly every case. Why? I guess you’ll have to keep reading.

The story began long ago (2010), in a far away land (Germany). Frank Karlitschek, a KDE developer, started working on ownCloud. He envisioned an alternative to DropBox, one more focused on privacy. In his blog, he argued “Privacy is the foundation of democracy.” In 2011, ownCloud Inc. was born in conjunction with the original software release. The company went on to raise 6.3 million USD in 2014 as it started to target more lucrative enterprise clients. The future looked bright for the young startup, but just two years later, things got messy.

On April 27, 2016 Frank Karlitschek left ownCloud. A few short days later, he founded Nextcloud, a direct rival. Karlitschek stayed pretty tight-lipped about the whole situation, but actions speak very loudly here. Leaving the company you co-founded and days later creating a competitor is a bold move indeed.

Investors agreed. Behind the scenes, ownCloud was on the verge of sealing a major deal. Karlitschek, along with several other core members, abandoning ship to form Nextcloud, was more than enough to send investors reeling. And the final nail in the coffin? Nextcloud decided to offer free support for any users migrating from ownCloud.

Just days later, ownCloud (the US company) shut its doors for the last time. So why the debate? OwnCloud is dead, right? Well… not quite. Like a good zombie, it won’t go down easy. While US based ownCloud Inc. had all its credit revoked, the parent German company, ownCloud GmbH, carries on to this day.

Karlitschek has never discussed the exact reasons leading up to his departure, though in the blog post announcing his resignation, he states, “…the company could have done a better job recognizing the achievements of the community. It sometimes has a tendency to control the work too closely and discuss things internally.” Most speculate ownCloud’s lack of interest in the outside development community ultimately lead to Karlitchek’s decision.

Today, OwnCloud continues to focus on its enterprise business, having a number of features available only to paying customers. In contrast, Nextcloud focuses heavily on security features: brute force protection, 2FA, video verification, and more. Nextcloud also remains 100% open source. This means all its features are included standard — enterprise customers just pay for support.

Enough history, let’s get on with the feature list.

Feature NextCloud ownCloud
Open source ✓ Mostly
Unlimited storage ✓ ✓
Self-hosted ✓ ✓
Mobile app support ✓ ✓
Automatic media upload ✓ ✓
Integration with Outlook ✓ Premium only
Text search ✓ ✓
Version control ✓ ✓
Calendar, contacts, etc ✓ x
Notifications ✓ x
Server-side encryption ✓ ✓
Client-side encryption Yes, but buggy x
Access controls Great Good

Normally, I would jump right into the feature comparison, but I felt in this unique case, the history should play a role in the decision. Which software do you use? Let me know in the comments.