Tech Talk: Ruminating on Data Drift and Concept Drift

Wednesday, September 01, 2021

Ruminating on Data Drift and Concept Drift

Quite often, the performance (accuracy of prediction) of a AI model degrades with time. One of the reasons this happens is due to a phenomenon called as "Data Drift".

So what exactly is Data Drift?

Data Drift can be defined as any change to the structure, semantics or statistical properties of data - i.e. model input data.

Changes to structure of data: New fields being added or old field deleted. This could happen because of a new upgrade to a upstream system, a new sensor, etc.
Changes to semantics of data: A new upstream system is sending temperature in F and not C.
Changes to statistical properties of data: Changes to the atmospheric pressure threshold levels due to environmental changes. There could be also data quality issues such as a bug in upstream system that delivers junk data.

To maintain the accuracy of our AI models, it is imperative that we measure and monitor Data Drift. Our machine learning infrastructure needs to have tools that automatically detect data drift and can pin-point the features that are causing the drift.

Changes to the underlying statistical properties of data is also called as "Concept Drift". A classic example of this is the current pandemic. The "behaviour" or "concept" has changed after the pandemic - e.g. models that predict the amount of commute time are no longer valid. Models that forecasted the number of the cosmetic surgeries in 2021 are no longer valid.

Most of the hyperscalers provide services that enable us to monitor data drift and take proactive actions. The below links provide some examples of cloud services for data drift:

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-monitor-datasets?tabs=python

https://aws.amazon.com/sagemaker/model-monitor/

https://cloud.google.com/blog/topics/developers-practitioners/event-triggered-detection-data-drift-ml-workflows