Short description: Data mining and machine learning techniques, including Bayesian and neural networks, for diagnosis/prognosis applications in meteorology and climate. | |
Data mining is the process of extracting nontrivial and potentially useful information, or knowlege, from the enormous data sets available in experimental sciences (historical records, reanalysis, GCM simulations, etc.), providing explicit information that has a readable form and can be used to solve diagnosis, classification or forecasting problems. Traditionally, these problems were solved by direct hands-on data analysis using standard statistical methods, but the increasing volume of data has motivated the study of automatic data analysis using more complex and sophisticated tools which can operate directly from data. Thus, data mining identifies trends within data that go beyond simple analysis. Modern data mining techniques (association rules, decision trees, Gaussian mixture models, regression algorithms, neural networks, support vector machines, Bayesian networks, etc.) are used in many domains to solve association, classification, segmentation, diagnosis and prediction problems. Among the different data mining algorithms, probabilistic graphical models (in particular Bayesian networks) is a sound and powerful methodology grounded on probability and statistics, which allows building tractable joint probabilistic models that represents the relevant dependencies among a set of variables (hundreds of variables in real-life applications). The resulting models allow for efficient probabilistic inference. For example, a Bayesian network could represent the probabilistic relationships between large-scale synoptic fields and local observation records, providing a new methodology for probabilstic downscaling: i.e. allowing to compute P(observation|large-scale prediction). For instance, the red dots in the figure below correspond to the grid nodes of a GCM, whereas the blue dots correspond to a network of stations with historical records (the links show the relevant dependencies, automatically discovered from data). Formally, Bayesian networks are directed acyclic graphs whose nodes represent variables, and whose arcs encode conditional independencies between the variables. The graph provides an intuitive description of the dependency model and defines a simple factorization of the joint probability distribution leading to a tractable model which is compatible with the encoded dependencies. Efficient algorithms exist to learn both the graphical and the probabilistic models from data, thus allowing for the automatic application of this methodogy in complex problems. Bayesian networks that model sequences of variables (such as, for example, time series of historical records) are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. On the other hand, neural networks are nonlinear models inspired in the functioning of the brain which have been designed to solve different problems. Thus, multi-layer perceptrons are regression-like algorithms to build a deterministic model y=f(x), relating a set of predictors, x, and predictands, y (figure below, left). Self-Organizing Maps (SOM) are competitive networks designed for clustering and visualization purposes (right). Key Reading (methods):
Key Reading (applications in Meteorology):
Activities of the Santander Meteorology Group:
People: Herrera, S., Gutiérrez, J.M., Cofiño, A.S., Sordo, C.M., San-Martín, D., Bedia, J. |