Machine learning on a budget, Part 2

Hit the ground running!

5 minute read

In part 1 of this short guide to machine learning and data collection, we talked about a few different methods

How much accuracy do you need?

Accuracy doesn't come linear. If 95% can be reached with 100k datapoints, it might take a million to reach 96%!

The relation between the amount of training data used and the resulting accuracy of the trained machine is far from linear. If you can achieve 95% accuracy using 100 thousand labelled data points, it might take another 100 thousand labelled data points to get to 96% accuracy. So that 1% improvement could double the costs of obtaining the data, as well as the time needed to collect it, and the time needed to train the machine, but will it double your profits?

I'd say it strongly depends on what you're selling. If you're offering a pioneering, innovative solution, it might be wise to launch your product with 95% accuracy, make your brand internationally well-known, close some deals with new investors, and use the profits to clear the costs of the initial developing phase. Then you can take your time (and use your new resources) to collect more data, reach 96% accuracy and start selling version 2.0 of your already successful product.

On the other hand, if someone else has beaten you to market, you might need that extra 1% accuracy to make any profits at all. The message here is: hire some serious experts to analyse what impact the next batch of labelled data points would have, and whether it wouldn't be easier and cheaper to make improvements through changes in the algorithm. Instead of gaining 1% improvement by doubling the training dataset, you might be able to gain 3% accuracy without acquiring any new data, simply by tweaking a few parameters in the algorithm.

How much autonomy do you need to offer?

A setup where the machine helps you with your task is way easier to achieve than one where you need to completely rely on it

As I explained above, most machine learning algorithms calculate, for a given input, a value that's somehow in-between the various valid outputs, and then chooses and returns the valid output that is closest to that intermediate value. If you want, you can instead tell the machine to be honest when it's in doubt. That is, the possible outputs could be: "this is definitely a picture of a cat", "this is definitely not a picture of a cat" and "this one is tricky... I'm not sure".

With this approach, the machine is not entirely autonomous, but it requires the human users to work less than without the machine. If a worker sadly has to classify 4000 pictures a day into the categories "cat" and "no cat", I'm sure they'd be very glad to use a program that can classify 3900 of them on its own, asking the worker to classify only the 100 remaining ones.

If the task is to classify pictures into not only 2 but thousands of different categories, even when the machine is uncertain it can still narrow down the possibilities to help the human user. For example, it could say "I'm not sure what is in this picture. It's definitely a mushroom, yet I don't know the exact species", and then the human would know they should call a mushroom expert for a consult.

Again, depending on the product you're selling, and depending on the expected cost of making a fully autonomous version, this hybrid version could be a profitable idea that requires much less labelled data.

On top of that, the hybrid algorithm can be combined with active learning. Each time the machine is unsure of the output for a given input (which is an unlabelled data point) and it has to ask the human user for help, the decision made by the human can serve as a label. The new (labelled) data point can be used to train the machine further, improving its accuracy and confidence. Moreover, the new data point can be sent to a central server so that all instances of your product can be updated. This way each user would profit from other users being consulted by their machines.

Get your clients to help you!

(And don't tell them it was my idea.)

And don't tell them it was my idea! Indeed, it isn't my idea. In his lectures and talks, Professor Andrew Ng defends the idea of using your clients as part of a positive feedback loop consisting of "data collection", "product development" and "user acquisition".

The iterative process of improving your AI's capabilities by help of others

And don't tell them it was my idea! Indeed, it isn't my idea. In his lectures and talks, Professor Andrew Ng defends the idea of using your clients as part of a positive feedback loop consisting of "data collection" >> "product development" >> "acquisition of users" >> "data collection" and so on.

In short, you start by collecting as much labelled data as you can. It might not suffice to make the product you were dreaming of, but maybe it suffices to make a sellable product. Once you have a sellable product, you can finally get your first few users. As your users apply the product to real-world cases, you obtain more data. This might be unlabelled data if you simply tell the algorithm to send every input to a central server to be stored, or it might be labelled data if you're using a hybrid approach as described in the previous section.

If you're getting new labelled data from your users, you can train the machines further. If you're getting unlabelled data, you can either get it labelled (depending on the costs) and train the machines further, or you can use it unlabelled via semi-supervised learning. In any case, the new data points can help you make a better product, which will get you more users, which will provide more data, and so the positive feedback loop continues.

When some of Professor Ng's students in Stanford took his advice, they created a company called Blue River Technology, and a product which used machine learning in agriculture. By fixing cameras onto tractors, their program could look at the plants and decide which ones were weeds (to be eliminated with pesticide) and which ones were part of the crop (to be nurtured with fertiliser). This reduced the amount of both pesticide and fertiliser used, and reduced the amount of pesticide that came into contact with the crop.

They started the loop by visiting crops personally and taking pictures manually, thus obtaining a relatively small database through a relatively large amount of work. But after a few iterations of the cycle, their startup was bought by John Deere for a whopping 305 million dollars.

For everyone

Are you out of excuses not to implement your machine learning based million dollar idea?

Got questions about machine learning and AI? Drop us a line!

more insights