Usage of symbolic data analysis in modelling with SHARE data: Some examples
Symbolic data analysis (SDA) is a special set of statistical approaches that, through the aggregation of data, form new, multidimensional types of variables, described by boundaries (example of intervals) or statistical moments (example of histograms and distributions). In the lecture we will first look at history of this type of methods and their basic characteristics. Then, in the case of using the data from the Survey of Health, Aging and Retirement in Europe (SHARE), we will talk about two new statistical contributions that we have presented over the past year using SDA.
In the first, our topic is regression models for distributional symbolic data. Such regression models exist and are the work of three groups of authors: founders of SDA, Lynn Billard and Edwin Diday presented a basic regression model that follows the principles of conventional linear regressions (Billard and Diday, 2006); Sonia Dias and Paula Brito extended and corrected the model for possible negative coefficients in quantile functions (Dias and Brito, 2011); Antonio Irpino and Rosanna Verde, finally, adapted the model to the Wasserstein distance decomposition and divided it into a part that estimates the effects of the average values of the independent variables and a part that estimates the effects of their variance (Irpino and Verde, 2012). However, never before have such regression models been used in causal inference. We have expanded the models to take into account the presence of endogeneity in the variables and developed new, specific estimators for all three models that take into account model endogeneity and are based on the two-stage least squares method (2SLS) for quantile functions. In the presentation we will describe statistical features of the new estimators and its usage in the case of assessing the impact of retirement decisions on various health indicators.
In the last part of the presentation we will briefly present another own paper that uses a special, new type of symbolic variables that has not been explored in the literature – polygonal variables (to our knowledge, so far the only article to address this type of variables is Silva et al. 2019). The latter are based on the construction of a variable as a polygon with any number of angles, where the first moments of the distribution of the basic variable, average and variance, form the basis for conversion into a polygonal variable. We will briefly outline the new, first existing, cluster analysis prcedure for these types of variables in the case of classification of country regimes according to their characteristics of health care usage by the elderly, using again the SHARE database.
The presentation will be based on the following two contributions:
Srakar, Andrej, Prevolnik Rupel, Valentina, Bartolj, Tjaša. Program evaluation and causal inference for distributional and functional data : estimation of the effects of retirement on health outcomes. V: Mineo, Angelo M. (ur.), Augugliaro, Luigi (ur.). EMS 2019: program and book of abstracts. [S. l.]: Bernoulli Society for Mathematical Statistics and Probability. 2019, str. 227.
Srakar, Andrej, Vecco, Marilena, Kejžar, Nataša. Entrepreneurial regimes classification: a symbolic polygonal clustering approach. 16th Conference of International Federation of Classification Societies (IFCS), Thessaloniki, Greece, 26th - 29th of August 2019.