Your go-to MLOps Guide
The need for MLOps
The process of developing a software application doesn’t simply end with writing code and testing it. There are a whole bunch of steps that follow for an application to be fully ready for production.
Once a software application is thoroughly tested it is first packaged or containerized. The application is then load-tested to understand how much throughput it can handle. It is then deployed in a cluster with autoscaling configured to scale the application whenever the load increases. It is monitored to check if it is healthy, response times are under the desired upper limits and the application gives the desired output. The application logs are then saved and rotated. When a new version of the application is released it is rolled out to users by making sure there is no downtime.
If CI/CD is followed properly then as soon as a change is made to the codebase the application is deployed in a staging environment where it is tested immediately. If there are any issues then they are flagged immediately and fixed. Fixes are deployed immediately and the improved version of the application is tested again.
What about ML models?
Just like how we follow a cadence to build, maintain and scale software applications we should follow a cadence for ML models. That is exactly what MLOps is all about.
MLOps at NeuralSpace
At NeuralSpace we follow the same principles as software applications for ML models. We have also designed our platform so that you can train, deploy, test, and iterate NLP models faster and go live with the best model possible without any downtime. You can also handle any throughput you want with just a few clicks.
Let’s break down the whole process for you:
Let's say you have a social media platform and you use an ML model to moderate text content by classifying them into two categories (abusive or not abusive). You train the model and test it on a test set that represents your user-generated content. Once your model gives good results you plan to deploy it as a microservice. Once you have done that you realize that at peak hours your service is overloaded. To handle that you scale your model microservice and deploy multiple replicas to handle more load. Now you want to monitor everything that has passed through your model. E.g., all model predictions. So, you set up some logging mechanisms and build a UI to view and fix model predictions. Now you fix the mistakes that the model made and add more data to your training and testing set. You do this so that you can retrain your model later and test if it has improved or not. You now retrain the model and see that it has improved over the previous model. So, it is time to update the model in production. But since you cannot have any down-time. You need to package/containerize your model again, version it, rollout the new version using standard DevOps techniques to maintain 100% up-time.
Wait, you are not done yet. You have to repeat all this from time to time to make sure your model is up to date and adapts to changes in data (user behavior).
Now some questions to think about:
How can you make this process more structured by following best practices?
Can you use software development and CI/CD principles to train and deploy models in an “Agile” way?
Of course, the answer is yes!
How to build a solid MLOps process
The following are some steps you can follow to build a solid process for MLOps.
Step 1: Automate your model training pipeline
Step 2: Manage your Data
Manage your data in a cloud storage if it is not already provided by the AutoML provider. NeuralSpace has a Data Studio to create, update, and manage your data for various NLP tasks.
Version your datasets. There are many ways to do it yourself if it is not already done by your AutoML provider. You can either do it by hashing the entire dataset to generate a unique version automatically or use a DB to keep metadata about the dataset and the actual files in the cloud storage.
Step 3: Continuous Integration
Use CI pipelines to trigger training and benchmarking with specific dataset versions. If budget is a concern then go for Azure ACI, OVH Cloud, or other cloud providers which provide on-demand options for spot GPU instances. You can run a hyperparameter search every time as well but depends on how much you are willing to spend. If you have found the best model architecture then you can just use the same configuration.
Step 4: Monitor Model Performance
Monitor model performance continuously. At NeuralSpace you get model metrics every time you run AutoNLP. Otherwise, you can use platforms like Weights and Biases to log model metrics and meta information to keep track of training process.
Step 5: Continuous Deployment
Deploy new models automatically so that you can compare it with the model in production. Use CD pipelines to automate the deployment process.
Use a fire and forget mechanism to evaluate your new model on real-world data. This gives you a real picture of how the new model performs in a real-world scenario. You might have to add your own custom logic to do this. Make sure to have this in the design.
A/B test your new model by gradually increasing traffic on the new model. E.g., 10% then 20% then 30% and eventually 100%
Step 6: Cadence is Key
Create a model evaluation and improvement checklist with some or all of the above-mentioned steps.
Make sure you keep your model up to date
Whenever a new state-of-the-art model comes out follow your checklist and benchmark it against the existing model. You will be surprised how efficient simple yet elegant models/solutions can be. Don’t pick a new model just because it is cool. Use it because it works best with your data and use case.
AutoMLOps at NeuralSpace
Now the question is how can we deploy the model so it can serve thousands of API requests every minute?
AutoMLOps is the answer.
Once the model is deployed, it is encapsulated into a production-ready infrastructure by AutoMLOps that allows your model to process a large number of requests per second with a linear increase in your infrastructure.
In short, AutoMLOps provides scalability and availability to your models and it seamlessly manages all operations from compute resources allocation to scaling the model.
There are several things you need to have in order to deploy a model in production. For any software application, it is common to host multiple instances (or replicas) of the same algorithm so that the application can handle high traffic loads and reduce failure rates. NeuralSpace’s AutoMLOps ensures that your model is 25/7/365 available and can process large amounts of API requests per second. When deploying, developers have the option to select multiple replicas, which jointly have low latency whenever your model is requested by the end-user. Also, in rare cases when one replica fails, developers have others to fall back on. This is achieved by serving your model replica in an autoscaling environment. The model replica uses our core pre-trained models as well and thus if the requests on your replica increase considerably our backend services automatically start auto-scaling. The replicas are deployed as an independent model with an independent service manager. Thus, while parsing, the replicas are directly called by our middleware so that the latency is minimum at all times. If replicas fail or crash in very rare instances, we have included an automatic spin-up of a replacing replica, which is live in less than ten seconds.
Check out this quick explainer video to learn more about AutoMLOps.
Join the NeuralSpace Slack Community to connect with us. Also, receive updates and discuss topics in NLP for low-resource languages with fellow developers and researchers.
Check out our Documentation to read more about the NeuralSpace Platform and its different Apps.
Sign-up on the NeuralSpace Platform.