Minimalism in Applied ML
Data is Abundant, Your Time is Not
Full Stack ML doesn’t have to take months on end like it used to; it can be done in days. Like any product or service, ML applications have to prove their worth in a market - the faster an idea is synthesized, the faster that idea can be tested.
Working in ML, common wisdom thinks it takes weeks (if you’re lucky) or months (probably) to put together a dataset and train a model. The truth is, in many cases it can take less than a day to prove out your idea. I call the following approach Minimalism in Applied ML - it’s a mindset, not a rule set.
What follows is a generalized discussion of how I think when it comes to acquiring a dataset for my problem.
The Problem isn’t Unique
Why do people think it takes so long to create custom models? They think their problems are unique. But for most ML problems, there is a dataset out there already that is going to be pretty close. That’s just the state of affairs these days.
My first step is always researching related topics / problems - and then finding datasets. This won’t always be a buttoned up, perfectly labeled and formatted to your favorite ML platform dataset, but it will be straightforward to fix.
If this is possible, I didn’t spend months creating, labeling, and cleaning a dataset, but I’ve got one anyway. It’s the same concept as pulling down code from Github or Stack Overflow and adapting it to your needs. There is no reason to reinvent the wheel.
Generate Data if Necessary
This isn’t always possible. When a problem is unique enough that there isn’t a dataset out there after a few minutes of searching, it’s not back to square one. Say the data is protected by HIPAA or is export controlled. No matter - it’s usually possible to create some fake data that will be close enough.
There are so many ways to make fast datasets, and frankly for a first pass, it’s not necessary to have thousands of examples to see if a model is going to work. There are three ideas at play here:
There are many tools out there that make it easy to create formatted data fast
Custom models aren’t always necessary - just fine tune an off the shelf model.
Augment or synthesize new fake data - it will get you far enough to prove viability.
My go-to choice for labeling data is Roboflow for CV problems, and Prodigy for NLP problems. Both allow me to label data and put it in a format that I can plug right in to a model for training.
I’ll explain with a recent example. Dillon and my latest idea for a pet project was to do golf ball detection with an iPhone. I took 10 minutes and looked for CV datasets for golf balls and didn’t find one in that time. Okay, can I build a quick dataset? I looked for another 5 minutes and found a site with videos of people playing golf that I could download for free. That’s my data. All I had to do was label it for object detection. Tedious? Maybe a little, but with Roboflow, uploading the mp4 and labeling the golf balls at each frame took probably half an hour for about 100 frames. Even better, they’ve got data augmentation and generation tools on there. For the time-price of 100 labeled images, I had 1,500 perfectly formatted images - plenty to fine tune an existing model.
It’s not hard to come up with a dataset and it’s a lot easier to be creative than you might think. The mindset is this: Start with a fast spin-up so you can get feedback, and iterate. You can’t iterate on nothing; thus the fastest way to iterate is the same as the fastest way to your first model.