Fisher Score-Based Concept Drift Monitoring and Diagnosis with Applications to Microstructure Nonstationarity Analysis and Deep Learning Segmentation for Multiphase Materials

Zhang, Kungang

doi:https://doi.org/10.21985/n2-1gpw-0z61

Work

Fisher Score-Based Concept Drift Monitoring and Diagnosis with Applications to Microstructure Nonstationarity Analysis and Deep Learning Segmentation for Multiphase Materials

Public

Download PDF

Supervised learning model is one of the most fundamental machine learning models. It can provide powerful capability of prediction by learning complex patterns hidden in many, sometimes thousands, predictors. It can also be used as a building block of other machine learning tasks, like unsupervised learning and reinforcement learning. Such power originates from labels as ground truth in the training data set and many regularization techniques to prevent overfitting. In many applications, the amount of data grows dramatically due to recent booming of technology and data collection techniques and such ``big data" combining with enormous computation resource enable complex and flexible models to be trained with huge predicting power in testing data. All those desirable performance are based on one condition that the predicting relationship is somewhat stable so that it can be effectively learned by supervised learning models. However, in real data sets, the nonstaritionarity of such predicting relationship happens all the time. The studies in this thesis all connect to the theme of quantitatively analyzing (i.e., monitoring, diagnosing, and interpreting) the nonstationarity in the predicting relationship in data sets. In Chapter 2, the effect of nonstationarity of predicting relationship over time (as known as \textit{concept drift}) on the performance of trained models are investigated. Approaches of monitoring and diagnosing such nonstationarity with Fisher score vectors for parametric models are proposed and demonstrated with theory and practical performance. More specifically, under fairly general conditions, Fisher score vectors have an property that when the predicting relationship in testing data is stationary with respect to the training data set, the score vectors of the testing data set should have zero mean, while if testing data become nonstationary the score vectors should have non-zero mean, a desirable property of score vectors that the popular error-based method does not enjoy. A simple logistic regression model shows that the error rate can stay the same even though concept drift happened, while the score-based method will always signal. Large scale Monte Carlo simulations and two real data sets also show that the score-based method is more sensitive to the concept drift across many different models, linear or nonlinear. For example, the score-based method can signal around $9$ months earlier than the error-based method in monitoring the concept drift in data sets of credit card default near the subprime crisis in $2007-2008$, while the error-based method only signals when stock market started to crash at the end of $2007$. Besides the higher monitoring sensitivity, the score-based method also has an advantage of interpretability in diagnosing the origin of concept drift. Two ways to decouple the concept drift from score vectors are derived and the usage is demonstrated on simulation and real data sets. In summary, the Fisher score vectors have many superior properties in monitoring and diagnosing the nonstationarity of predicting relationship than the popular error-based method. Its superior performance is consistent over many simulated and real data sets across many different models. In Chapter 3, instead of analyzing the nonstationarity of predicting relationship over time, the study focuses on nonstationarity monitoring and diagnostics in stochastic microstructures of materials, which is in spatial domain. A popular and effective way of modeling stochastic microstructures of materials is treating each image sample (called micrograph) as a realization of some underlying random process. In this study, parametric supervised learning models are trained to model the distribution of the random process. According to the intuitions obtained from the previous study of Chapter 2, the Fisher score vector calculated from the trained model for each pixel implicitly carries the information of microstructure at vicinity for that pixel and can be used to quantify potential nonstationarity in multiphase materials. Based on this insight, two approaches, nonstationarity monitoring and diagnostics for the applications of quality control and materials characterization, are developed and tested on a varied of simulated and real stochastic microstructures of materials. The results show that the score-based methods are powerful in 1) distinguishing different stochastic microstructures which are hard to characterize and separate with traditional descriptor-based methods; and 2) diagnosing nonstationarity of multiphase materials without prior knowledge of the stationary phase. In Chapter 4, the power of the score-based methods shown in Chapter 3 is further exploited and enhanced by the synergy between the score-based nonstationarity diagnostics and state-of-the-art deep learning models. More specifically, this study proposed a three-step iterative framework where in Step $1$ the score-based nonstationarity diagnostics method is applied to extract homogeneous/stationary phases from materials micrographs (potentially with multiphase) with reasonable accuracy; in Step $2$ a classification convolutional neural network is trained to archive the extracted materials phases to provide utility tools for experts to retrieve and rank those materials phases; in Step $3$ using microstructures and labels learned in previous two steps, a powerful segmentation networks is trained to segment different phases of multiphase materials with higher accuracy and less computation than Step $1$; and finally the previous three steps are iterated with more incoming micrographs to expand knowledge by building up the data repository of microstructures of materials and improving the performance of the models trained in this framework. This framework is flexible in the sense that it can be used as an integrated system or each step of it can be applied independently given that necessary inputs are provided. The advantages of using the entire framework is that the score-based nonstationarity diagnostics method provides an easy and low-cost way of labeling pixels of different phases/microstructures of materials and the following infrastructures (i.e., the convolutional neural networks for classification and segmentation) in combined provide a powerful toolchain to manage and analyze those multiphase materials. Various techniques in accelerating training processes and improving generalization performance of deep learning models are developed and demonstrated. This proposed framework can achieve very high segmentation accuracy in a diverse set of artificial and real materials data sets and its huge potential in solving characterization problems in materials science is justified. Finally in Chapter 5, several future directions extending studies in this thesis are discussed.

Creator