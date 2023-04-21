AI models need to be trained on data–what they’re expected to look at to recognize patterns and produce helpful output. The more data an AI model is trained on, the more likely it is to produce output that is aligned with the goals of the system. Naturally, the data being fed to the model needs to be vetted–it needs to be clean, accurate, privacy compliant, representative, unbiased and meet other requirements that the organization may have. This is the training data pipeline, and it is one of the factors in the lifecycle of the model. It is here where we have the beginnings of a trained model.

The next step, once the model is put into production, is feeding the AI model with a real-life data pipeline, that comprises data in a production environment. This is the data that the model will use to produce the intended output or insight. However, we do not live in an ideal world, and even with all due diligence considered, input data in the real world may not be exactly like the training data. Therefore, the live data pipeline and the performance of the AI model needs to be continuously monitored, allowing for useful outcomes. If the model starts predicting outcomes based on slightly altered or wrong data, the result will be very hard to identify, since the output anomaly detection may take some time.

Unlike software systems that have a relatively simple versioning system, an AI/ML system will have versioning across the overall application, the AI/ML model as well as the data pipeline. The data itself, when massaged/modified for whatever reason, will need to be versioned and a “data lineage” developed to debug AI systems. This is an evolving space on its own. Clearly, there is a significant governance aspect to MLOps that is essential to keep things manageable and functioning on a continuous basis.