Building machine learning systems is about building software and more
Let me start this article with one story: a friend that used to run a successful machine learning company, Marcos, was complaining about the following: “the amount of effort spent on R&D on Machine Learning is usually a small fraction of the total [development] effort, or it’s not even there because we plan it for a future phase after building the application first”. After listening to this, Peter Norvig, his interlocutor, quoted a friend of his:
“Machine Learning development is like the raisins in a raisin bread: 1. You need the bread first 2. It’s just a few tiny raisins but without it, you would just have plain bread.”
I don’t like raisin bread but I like the image brought by Peter: the bread is the software (development), the machine learning (models, data, experiments) is the raisin and when you compare both, the bread is the most of the total but the raisins are the special stuff, the magic ingredient.
Do we agree? Let’s go further with the image. Suppose you have invited some friends (or clients 😉) home and want to offer them a refreshment: what is the best scenario?
- Offer them a few tiny raisins
- Offer them a delicious bread
- Offer them a delicious raisin bread
For sure, the best scenario is 3, but reaching that scenario is not easy: it takes resources and time. Moreover, running an enterprise is never done in the best scenarios. Then, if we focus on the other two options, I think there is a clear winner: the delicious bread 😃
This article focuses on how to reach the “delicious bread”. In other words, how to build good software or, more specifically, how to write good code.
… Really? Really.
At first sight, you could think that this is not really needed, that data and models are the important stuff, but our experience says that working in machine learning with poorly developed software systems, a consequence of poor engineering skills, makes everything much more difficult. Moreover, the lack of engineering skills is a known problem by the Data Science and Machine Learning community. In the report “State of Data Science 2022” by Anaconda, 38% of the surveyed think there is expertise missing in engineering skills (this was the most voted answer):
There is a second reason for learning to write good code: writing good code is key to becoming an end-to-end data scientist. If you are not familiar with the concept, roughly, an end-to-end data scientist is … . This kind of professional improves the machine learning development in the following ways: …
Principles to write good code
Every discipline has some basic rules to follow. In this section, we will write about some principles for writing good code. These principles are simple but this does not imply they are simple to apply in daily life basis. Moreover, applying them is more difficult when what we want to code is not clear or when we are in a rush because there is something that has to be solved soon (common scenarios in a company).
A final comment before starting with the principles: if you pay attention, you will notice that most of the principles are related to ensure code readability.
1. Use descriptive names for variables and functions. For sure you have already listened to this advice and for sure you have not used descriptive names in some code that you have written recently.
2. Functions have to do only one thing. In DS/ML context, it is normal to find functions that do more than one thing. For example, “my function does data retrieving and data processing” or “my function does data processing and data presentation”. This is not recommended at all. Having functions that do one thing allows you to write functions that are easier to understand and to re-use. In addition, they are easier to test. Notice that there are particular functions whose task is to call other functions in sequences. for instance, “my function etl calls function data_retrieving, data_processing, and data_store”. This is not problematic.
3. Functions should be short. No more than 20 lines of code. Do you think this is not enough? In fact, some people say that functions should be shorter. If your function does one thing it is weird that it needs more than 20 lines of code. If your function implements a complex process, for sure the process can be split in smaller pieces.
4. Max level of indentations: When you need more than 3 levels of indentations this indicates that you are not organizing well your code, that you are missing the creation of one or two functions that will make your code more readable. Please, don’t write this kind of stuff (source https://codepict.com/indentation-in-python-with-examples/):
5. Your modules/files should not have more than 500 lines. Just like a function has to do one concrete thing, a module has to offer a concrete service.
6. Test your code. About this, we will skip the obvious comments and give other ones that we think are relevant and not usually mentioned:
- Testing is not just about checking outputs, testing is also about documenting with examples how your code behaves. When you have complex processes, an updated test set would be very appreciated for the people that are joining to your team.
- Testing also gives you the opportunity of working with small pieces of your real data. This allows you to iterate faster in the development process. It is not really smart to run your code with 10GB of data to discover after 5 hours that you have missed one comma. If you can run the same routines, in a test, with a small piece of data, you will discover the same issue in less than one second.
- You don’t like testing? We know it is difficult at the beginning, but you should start with it. You will appreciate it when in your function of 5 lines you discover 6 errors. (That’s happened a lot to me... Oh, believe it or not, testing is a source of instant reward 🤩)
- Testing is easier when you have short functions that do one thing.
7. Use debugging tools. Everybody loves debugging with “prints”… Until you learn how to use debugging tools 😎. Like testing, at the beginning it may seem difficult, but we can assure you it is not at all. You only need to learn how to define breaking points (as complex as writing a print), some basics commands ( “move the the next line”, “go to the next breaking point”, “show current code”, etc) and you will be able to check your code “in situ”. This way of debugging is much more powerful than the static prints.
If you can fusion testing and debugging you will reach another level as data scientist/software developer.
If you agree with the principles we just presented, you can check now how these principles are used in your own projects. Here there is a poll to do the checking and sharing the results.
One important thing: we have to check the whole project, and not only the code we write, because writing good code in a project that does not have the same quality as the code, has not impact at all.
Your code is not good?
- Refactoring the code that you daily use. It does not make sense to try to refactor the full system. Having a nice code should allow you quick improvements, it does make sense to tidy up a code that you don’t interact with constantly and already satisfies its functional requirements.
- Create the test first
- First tests have to be simple, the idea es become familiar with the test tools
- Tests have to be simple and explicit