구글
The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction
Monitor 1: Dependency changes result in notification
How? Make sure that your team is subscribed to and reads announcement lists for all dependencies, and make sure that the dependent team knows your team is using the data.
Monitor 2: Data invariants hold in training and serving inputs
How? Using the schema constructed in test Data 1, measure whether data matches the schema and alert when they diverge significantly. In practice, careful tuning of alerting thresholds is needed to achieve a useful balance between false positive and false negative rates to ensure these alerts remain useful and actionable.
Monitor 3: Training and serving features compute the same values
How? To measure this, it is crucial to log a sample of actual serving traffic. For systems that use serving input as future training data, adding identifiers to each example at serving time will allow direct comparison; the feature values should be perfectly identical at training and serving time for the same example. Important metrics to monitor here are the number of features that exhibit skew, and the number of examples exhibiting skew for each skewed feature.
Monitor 4: Models are not too stale
How? For models that re-train regularly (e.g. weekly or more often), the most obvious metric is the age of the model in production. It is also important to measure the age of the model at each stage of the training pipeline, to quickly determine where a stall has occurred and react appropriately.
Monitor 5: The model is numerically stable
How? Explicitly monitor the initial occurrence of any NaNs or infinities. Set plausible bounds for weights and the fraction of ReLU units in a layer returning zero values, and trigger alerts during training if these exceed appropriate thresholds.
Monitor 6: The model has not experienced a dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage
How? While measuring computational performance is a standard part of any monitoring, it is useful to slice performance metrics not just by the versions and components of code, but also by data and model versions. Degradations in computational performance may occur with dramatic changes (for which comparison to performance of prior versions or time slices can be helpful for detection) or in slow leaks (for which a pre-set alerting threshold can be helpful for detection)
Monitor 7: The model has not experienced a regression in prediction quality on served data
How? Here are some options to make sure that there is no degradation in served prediction quality due to changes in data, differing codepaths, etc.
ML 관련 | Ops 관련 |
Input Data Distribution | Request Latency |
Feature Distribution | Request Error Rate |
Output Data distribution | CPU, Memory Utilization |
Performance(Evaluation) | Disk I/O |
Model Stability | Network Traffic |
... | ... |
+) 구글이 제시한 전통 SW 모니터링 지표는 다음 4가지!
- Latency - 사용자 요청이 응답을 받기까지 걸리는 시간
- Traffic - 시스템이 처리해야하는 총 트래픽
- Errors - 사용자의 요청 중 실패한 비율
- Saturation - 시스템의 포화상태
'Projects' 카테고리의 다른 글
[Git] EasyOCR 뜯어보기. (0) | 2022.03.10 |
---|