Invaluable lessons from contributing to and leading tasks on an Omdena Machine Learning challenge with 50 collaborators. In this article, we share with you the 7 most important features of a successful machine learning project with a large team and on production level.
Code, frameworks, and models are definitely important in Machine Learning projects, but they are means to an end, they are tools. Choosing wisely is one of the components of a greater whole, and is subject to the goals of the project, the end-use, and the data. All these components need to be put together in a way that makes sense and achieves the desired goal, and the only way to do this is through communication: it is what will differentiate an actual actionable product from a bunch of pieces loosely bundled together.
1. Communication is the single most important thing
On an Omdena challenge, you form a team with people from different countries around the world. What is the first thing you do? You start talking in English because that is probably the only common language. Simply put, you adjust your communication.
Now apply the same principle to the tech side, considering that the entire team is coming from different backgrounds and has varying levels of expertise. If your team can’t communicate clearly you will waste a lot of time (and effort) due to confusion and rework.
Pay attention to and give high priority to your and others’ communication. Don’t leave a meeting without asking something you did not understand and if you sense others may be doing it, encourage them to solve their doubts. This may seem “too much of an effort”, but it’s always better to do it sooner rather than later. Personal example: I find that images tend to communicate ideas better, so I was always sharing my screen and drawing things to make sure that we were really on the same page.
2. Understand (and read) the problem statement
If you don’t have one, write down a problem statement with client requirements. Then read it out loud. Then re-read it. Then read it once again. Then read it every week, at least.
Before a single line of code is written, make yourself familiar with the domain. Not only because of the jargon but also learn of its technical aspects. For example, our project’s main goal was lump (object) detection in ultrasound images. Breast sonograms are always generated from the top (patient skin) to bottom (within the breast), so flipping vertically will most likely not be a valid augmentation: in practice, certain features will only appear below the lumps. One other example is the fact that a lot of the features that help to identify a lump are actually outside of it, e.g. echo patterns. These two examples are not so common in most object detection datasets but are more common in the health domain.
This is going to be invaluable to project meetings because you’ll be much more prepared to ask relevant questions. Additionally, having the problem requirements imprinted in the back of your head will help you and the team by both keeping a sense of urgency and giving a more accurate picture of what is important (and what is NOT).
3. Agree on expected outcomes and metrics
It’s only by visualizing what you want to achieve that you (and the team) will be able to move. Metrics represent an objective, a direction to move to, as they are the translation of client-understandable outcomes to the language of algorithms.
As Andrew Ng explains in Course 3 of the Deep Learning Specialization (the best course of the specialization, in my opinion): if a Machine Learning project were a competition of bow and arrow, setting a metric is like placing the target, and modeling is like practicing to hit the target. You do not want to spend a lot of time practicing to hit at a target just to have it moved to a different location.
It’s also very important to agree on these metrics, as they will serve as a standard form of communicating your progress within the team. This is not to say that metrics can’t be changed: if down the line the team figures out that the metric being used is not adequate, it’s absolutely possible to change it.
4. Discuss Git Workflows
The main outcome of a Machine Learning project comes in the form of code, so tending to and caring for the code base is paramount. So it’s important to discuss Git workflows and best practices. The goal here is not to fix practices or workflows in stone, but just to have a few key points discussed, sorted, and agreed upon within the team. It could be as little as agreeing that everybody should have their own branch, or that merges to the main branch should happen via Pull Requests with one or a few collaborators as evaluators of the request (maybe the responsible for each task).
You’ll be joining people coming from various backgrounds and levels of expertise, so this is also a good way to level the knowledge within the team. And it may avoid disastrous commits/merges later on.
5. Split the data VERY early on
The most important consequences of this step are
i) Avoiding data leaks and the introduction of bias.
ii) Creating a standard communication in the form of a dataset. Having standardized splits will enable model comparison through their metrics.
This setup will also help when new data is acquired: it’ll make it faster to know where to place it.
It’s important to take a quick glimpse at the data as the first coding step, but do not go full Exploratory-Data-Analysis mode just yet: you can introduce your own biases without even knowing. Get an overview of the data, distributions, size of the dataset, data types (in the case of our project, image types). Right after that, discuss with teammates the split of the datasets. Here, having an understanding of the goals and of the domain will help a lot when thinking about what must be considered.
The most important thing for test and development sets is that they should have a distribution close to that of the end-use. Once set up, completely forget about the test set: you’ll only look at it again at the absolute end of the project. Use only the development set for evaluation and hyper-parameter tuning. For the train set, data similar to the end-use should be used as well, but you can add a little more variety: the model will have more examples to learn from and is especially handy when a small quantity of data is available, as you can use data not exactly matching the end-use distribution.
6. Make friends with the dataset(s)
This means inspecting the data thoroughly. I may be preaching to the choir here, but there is no way around this: you have to investigate and “talk” to your data.
For instance, inspect if the data acquired matches the end-user. One of the datasets we acquired in our project contained only small patches of breast lumps. These images were very small, and contained practically only the lump: it did not match the end-use at all! Other examples are checking for class imbalance (may change approach to the problem — e.g. anomaly detection, class weights, etc.), and checking for duplicates, among many others.
7. Adapt tools to the work
Reading about and researching both the problem and the domain will also help you adapt your work to this domain. Noticed how I mentioned duplicates in the last topic? This is more common in tabular data, but it’s just as important when working with images: it’s just done differently! In the ultrasound project, our approach was to run all images through an EfficientNet, get their embeddings (vector representations) from one of the last layers, use a clustering algorithm to visualize and select a distance threshold to the group, and remove too similar images. This made sure duplicated images were not used in training.
Another example: should I use Tensorflow or PyTorch? It should not matter that much unless you specifically need to use one of those due to previous work or to client requirements. One solution would be using ONNX, a format interchangeable between frameworks.
Another solution (if using more than one framework is feasible) would be applying Object-Oriented Programming principles and making the API or Pipeline unaware of the type of object. For example, you could create a common interface (abstract base class) to wrap all possible prediction models, import the concrete implementations in the API and call a specific method: from the deployment perspective, it only matters that the object has the method.
In a team, it is not just about finishing your work and dumping it somewhere: from the project’s perspective, it is even more important (and way more difficult) to make sure that all the parts make sense and work well when put together.
Ask questions, show interest, talk to and learn from other teammates working in other parts of the project: it’s your job as much as it is theirs to deliver a final product that makes sense, be it a full-fledged Inference API deployed in the Cloud or a PowerPoint presentation. It is not enough to just train “your” model and leave the deployment to the “deployment people”: the result is achieved by the whole team, or it isn’t. This part can be easily overlooked, so remember to check for that!
You and each one of your teammates are pieces of a whole. Every topic cited in this article contains a form of or is related to communication somehow, and this is not a coincidence. Collaborative projects are about merging works of different people to create a much bigger result than just the “addition” of each individual work: in collaborative projects, 1 + 1 should equal 3, not just 2!
I firmly believe that the only path to achieve this is good and effective communication. It doesn’t mean that everybody should know how to do everything, but it does mean that everyone in the project should at least understand what is being done across the entire project — it is a fine mix between specific work, holistic view, and good communication that will make the project and its members successful.