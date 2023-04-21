Machine Learning Operations: Taking AI From Proof-Of-Concept To Scale
AI may be relatively new to IT systems, but requires the same, if not more, rigour to implement correctly.
Picture this: your e-commerce operations are hitting their stride, and you’re processing hundreds of thousands of transactions, poised to reach the millions during the festive season. Hand curation has long since become unwieldy and it’s clear that employing artificial intelligence and machine learning is the way forward to ensure that more buyers see more things they want and will buy. Plug, play, profit.
If only it were this easy.
More so than with any other tech process, it’s garbage-in, garbage-out. And when you’re talking about ML and AI, the results can be far more insidious, and not immediately apparent to be undesirable.
AI Is Not Just IT
The benefits of an AI solution go far beyond simply making a process more efficient and accruing exponential benefits arising from that. With AI, one expects outsized impact of business transformation–prediction, insight and analytics that allow businesses to make decisions beyond what would be possible just looking at business intelligence dashboards.
Let us consider the case of banking software processing a transaction. You pass a debit entry, pass a credit entry, and the software takes care of it. It’s deterministic. The meaning of a Rs 100 debit or credit remains the same no matter when or where it is considered. This may not be the case for an AI model that, say, uses age as a variable. Age considered now and weeks or years from now is a very different thing, and AI models will treat it differently based on different scenarios.
This makes AI models integrated into production systems very difficult to debug. In our banking example, errors could be introduced in calculations, and relatively easy to debug. In the case of an AI model, there is no clear outcome that is “wrong”; there is an undesirable or underperforming outcome, and one must figure out where and why it has happened.
Engineering ++
MLOps is a combination of machine learning and software engineering. For a traditional software system, things can break down due to a bug. The bug can be fixed, and the system can resume operating, delivering the intended result. It is continuous integration, continuous delivery (CICD).
An AI system can break down due to a variety of reasons, and deliver unintended outcomes, very often due to the data pipeline itself. So, AI models also need continuous training and continuous monitoring–CICD + CTCM.
Data And Process Pipelines: Where Things Break
AI models need to be trained on data–what they’re expected to look at to recognize patterns and produce helpful output. The more data an AI model is trained on, the more likely it is to produce output that is aligned with the goals of the system. Naturally, the data being fed to the model needs to be vetted–it needs to be clean, accurate, privacy compliant, representative, unbiased and meet other requirements that the organization may have. This is the training data pipeline, and it is one of the factors in the lifecycle of the model. It is here where we have the beginnings of a trained model.
The next step, once the model is put into production, is feeding the AI model with a real-life data pipeline, that comprises data in a production environment. This is the data that the model will use to produce the intended output or insight. However, we do not live in an ideal world, and even with all due diligence considered, input data in the real world may not be exactly like the training data. Therefore, the live data pipeline and the performance of the AI model needs to be continuously monitored, allowing for useful outcomes. If the model starts predicting outcomes based on slightly altered or wrong data, the result will be very hard to identify, since the output anomaly detection may take some time.
Unlike software systems that have a relatively simple versioning system, an AI/ML system will have versioning across the overall application, the AI/ML model as well as the data pipeline. The data itself, when massaged/modified for whatever reason, will need to be versioned and a “data lineage” developed to debug AI systems. This is an evolving space on its own. Clearly, there is a significant governance aspect to MLOps that is essential to keep things manageable and functioning on a continuous basis.
Governance Is Everything
Now, when our proof-of-concept must move into production, we must deal with actual, live data to feed the AI model. The live data distribution and schema often may not match exactly with the type of distribution and schema one has fed to the AI model during training. This is where versioning, or “data lineage” comes into play, and must be rigorous, if one must maintain the performance of an AI/ML system. Considering the requirements of CICD, CTCM and versioning, it is almost imperative for one to use some sort of governance model to build a PoC and maintain it into production. It is entirely possible for the administration of an AI model to outweigh its benefits, if not done correctly.
As it stands, while AI and ML are accessible, they require a certain degree of scale to deliver results. Complex ecosystems, complex data and Fortune 500 scale with multiple models are ideal for this kind of application. They also represent the scale where MLOps and strict governance are essential. When dealing with multiple AI models in a large organization, one must ensure that the correct model, or version of model, is in play. One must also ensure that the model is working as expected, continuously, that the data is privacy protected and that the ethics of AI are adhered to.
This all requires a governance framework to build and maintain in any reasonable fashion. Platforms such as H2O and DataIQ (and others that include cloud-based and open source offerings) recommend building PoCs with frameworks that include workflows and versioning from step one.
Ultimately, the entire system must be considered as being governed–the data sets, the AI models, the engineering aspects in terms of latency, volume, compliance, and ethics.
New Solutions, Same Rigour
AI models have the potential to exponentially improve outcomes for businesses if they are aligned with business goals. However, they require process rigour to implement for best results, much like IT systems in general have developed processes and frameworks for well-governed, reproducible results.
The outcomes of AI models are such that makes debugging very difficult. For AI, debugging is not just a development task, but also involves the complexities of data pipelines, integrations to various systems, monitoring of end-to-end data-to-design pipelines and obtaining efficiencies across the system as it scales and changes.
AI may be relatively new to IT systems, but requires the same, if not more rigour to implement correctly.
Ajoy Singh is COO at Fractal and Suraj Amonkar is a Client Partner- AI@Scale, Machine Vision and Conv.AI at Fractal.
The views expressed here are those of the author, and do not necessarily represent the views of BQ Prime or its editorial team.