Advanced
A Comparative Study on the Spatial Statistical Models for the Estimation of Population Distribution
A Comparative Study on the Spatial Statistical Models for the Estimation of Population Distribution
Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography. 2015. Jun, 33(3): 145-153
Copyright © 2015, Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : April 04, 2015
  • Accepted : June 06, 2015
  • Published : June 30, 2015
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Doo-Ri Oh
Member, Dept. of Geography, Kyung Hee University (E-mail:droh@khu.ac.kr)
Chul Sue Hwang
Corresponding Author, Member, Dept. of Geography, Kyung Hee University (E-mail:hcs@khu.ac.kr)
Abstract
This study aims to accurately estimate population distribution more specifically than administrative unites using a RK (Regression-Kriging) model. The RK model is the areal interpolation technique that involves linear regression and the Kriging model. In order to estimate a population’s distribution using a sample region, four different models were used, namely; a regression model, RK model, OK (Ordinary Kriging) model and CK (Co-Kriging) model. The results were then compared with each other. Evaluation of the accuracy and validity of evaluation analysis results were the basis RMSE (Root Mean Square Error), MAE (Mean Absolute Error), G statistic and correlation coefficient (ρ). In the sample regions, every statistic value of the RK model showed better results than other models. The results of this comparative study will be useful to estimate a population distribution of the metropolitan areas with high population density
Keywords
1. Introduction
Many complicated researches need the spatial data with a finer resolution than the enumeration unit such as administrative boundaries. Aggregated statistical data such as census data can provide a summarized description, but it can also imply some interpretative misinformation. Openshaw (1984) claims that statistical sources can be biased because results are the value of summary in arbitrary areal units, which are called Modifiable Areal Unit Problem (MAUP). Another problem of census data is that it assumes a homogeneous population in area ( Monmonier and Schnell, 1984 ; Langford and Unwin, 1994 ; Eicher and Brewer, 2001 ). Dasymetric mapping method can be one of the solutions to interpolate the census data because it allows realistic presentation of population ( Eicher and Brewer, 2001 ; Mennis, 2003 ; Mennis and Hultgren, 2006 ).
Dasymetric mapping method, the theoretical foundation for the study, has been studied extensively. Many researchers generally consider dasymetric mapping method as one of the areal interpolation methods ( Lee and Kim, 2007 ). In mathematical perspectives, it has been mainly addressed as a weight method for population interpolation for the target area. On the other hand, the use of higher resolution ancillary data is of interest in data aspects. Several ancillary data were used for dasymetric interpolation mapping such as residential area that were extracted from high resolution satellite image ( Ku, 2008 ; Wu and Murray, 2005 ), cadastral map data ( Maantay ., 2007 ; Tapp, 2010 ), streets and roads data ( Reibel and Bufalino, 2005 ), building data in downtown ( Lwin and Murayama, 2009 ), and the point data of individual residence ( Zandbergen, 2011 ). However, these high resolution data is limited in a certain area. It is impossible to conduct a study if there is no data in the study area. Weighting method such as areal weighting ( Mennis and Hultgren, 2006 ), population proportion, regression analysis method ( Flowerdew and Green, 1992 ; Reibel and Agrawal, 2007 ) in dasymetric mapping has also limitations since it may not reflect the characteristics of spatial data or make higher estimation errors of the results. In order to reduce the error estimation value in this study, different statistical models were used to calculate a weight to estimate the population’s distribution. This study has its own originality in that both RK model and Kriging models are used and compared for population estimation in study area. Also, because previous works ( Wu and Murray, 2005 ) used CK model for population estimation in urban area, we used the same model and we used OK and regression models to compare the differences among the other models.
This study aims to estimate the population distribution with more precise spatial resolution using geo-statistical methods. In order to do this, we applied four geo-statistical models and compared the results with each other. Four models are (1) regression model, (2) Regression-Kriging (RK) model, (3) Ordinary Kriging (OK) model and (4) Co-Kriging (CK) model. We investigate the applicability of RK model to estimate the population of the metropolitan area with the high population density because RK model is suitable to predict the subtle changes. In this respect, we considered the properties and the relations between the size of residential area and the population size for the study area. We chose the study area where it is suitable to use RK model to estimate population through testing samples several times. Both of the study area, Dongdaemun-gu and Jungnang-gu (Case 1), and Mapo-gu and Seodaemun-gu (Case 2), have relatively high population density in Seoul, Korea.
We used the land use data of 2000s from the NGII (National Geographic Information Institute) as the ancillary data or the target data in the dasymetric mapping. Two different types of residential area from the land use data were used as independent parameters in the regression analysis. We assumed that the population distribute only in the residential area using categories in land use data. Residential zones include binary area which are high population density area (buildings above 5th floor and subsidiary facilities) and low population density area (buildings below 5th floor and subsidiary facilities). And the smallest administrative areal unit (dong) from the census data of 2000 was used for the source data. Statistical analyses for the study were performed in the R 2.15.2 and then the ArcGIS 10 was used for the spatial analysis and mapping.
2. Approach
- 2.1 Dasymatric interpolation mapping with RK model
The RK model involves the linear regression model that analysis the whole drift/trend between the variables and the Kriging model that interpolates the residuals by regression ( Hengl ., 2004 ; Hengl ., 2007 ). For example, the RK model can be written as Eq. (1).
PPT Slide
Lager Image
Suppose that
PPT Slide
Lager Image
is the predicted population density at the location, s 0 ,
PPT Slide
Lager Image
is the predicted value from the regression model, and
PPT Slide
Lager Image
is the predicted value using the Kriging model. This equation can also be expressed as Eq. (2).
PPT Slide
Lager Image
where
PPT Slide
Lager Image
is the estimated regression coefficient, q k is the independent variable, ω i is the Kriging weight and e ( s i ) is the residual by the regression model at the location, s i .
The RK model goes through the following process: Step 1. Correlation coefficients were used in order to determine the relationship of the combination of land uses and population density so the combination with the highest correlation coefficients is selected. Step 2. Ordinary least squares (OLS) multiple regression was used to generate the surface of the population density. In performing a regression analysis, if a spatial autocorrelation is shown in the residuals, a generalized least squares (GLS) regression analysis can supplement OLS multiple regression. Step 3. The values from the regression analysis and the kriged residuals are added, the result will be the estimated population distribution
- 2.2 Dasymetric interpolation mapping with Kriging model
In order to perform the OK and CK model, the covariance matrix and the variogram are calculated with a primary variable (census data) for the OK model, and a primary variable and a secondary variable (the information of residential area) for the CK model. The primary variable and the secondary variable used in the CK model are interrelated spatially. Among the theoretical semi-variogram, the spherical model shows the best fit using the factors of the semi-variogram in every case in this study. Population weight used on the OK and CK model can be extracted from the semi-variogram model. Refer to Choe (2007) and Knotters . (1995) for more model explanation. In this study, we performed Kriging for population distribution with continuous variable and reaggregated values of population in each residential zones. In other words, the arithmetic mean of all the points in the grid cell of the estimated population density was calculated for the predictions.
3. Case Studies
- 3.1 A case study with the RK mode
We used the census data and binary residential area of the land use data to facilitate the dasymetric mapping with the RK model. After a correlation analysis between the census data and the binary residential area to determine the regression parameters, regression analysis was performed to generate the trend surface of the population density. GLS regression analysis was used instead of OLS regression based on the Durbin-Watson test. Minimize AIC (Akaike Information Criterion) can be used as criteria for model selection and we applied the AIC to the spherical, exponential, and Gaussian model. Refer to Akaike (1977) and Eldeiry and Garcia (2010) for more AIC explanation. Fig. 1 and Fig. 2 show the theoretical semi-variogram of the residuals and it was fit with an exponential model in Case 1 and a spherical model in Case 2. Table 1 provides the estimated coefficients by the type of regression models and the values of AIC for model selection in Case 1. The GLS residuals were interpolated using Kriging model to make the minimum and unbiased variance in locality. Estimated population density is the adding value of the results of regression model and the Kriging model.
PPT Slide
Lager Image
The theoretical semi-variogram, (a) exponential model, (b) spherical model, and (c) Gaussian model of the regression residuals of the Case 1
PPT Slide
Lager Image
The theoretical semi-variogram, (a) exponential model, (b) spherical model, and (c) Gaussian model of the regression residuals of the Case 2
Regression coefficients and AIC of the Case 1
PPT Slide
Lager Image
Significance level: ***: 0, **: 0.05, *:0.1
- 3.2 A case study with the Kriging model
We applied the OK model and the CK model to estimate the population’s distribution using Kriging in the study area. The OK model needs data on population as primary variable, whereas the CK model requires additional secondary variable that has spatial correlation between variables. Grid was created using midpoints of the smallest administrative areal unit to calculate the Kriging weights.
In order to model the population density using Kriging, covariance model has to be employed with primary variable, secondary variable, and cross variable. Utilizing the covariance model, we were able to perform the semi-variogram experiments several times and came up with factors of the semi-variogram. Semi-variogram provides information on the nugget, sill, and range to perform the Kriging. Using these factors, we selected the best model of theoretical semi-variogram. Fig. 3 , Fig. 4 , and Fig. 5 show the theoretical semi-variogram of the census data, residential area data, and cross-variogram between census and residential area, respectively for the Case 1. We also performed in the same procedure for the Case 2. Population weights used on an OK model and a CK model can be extracted from semi-variogram.
PPT Slide
Lager Image
The theoretical semi-variogram, (a) exponential model, (b) spherical model, and (c) Gaussian model of the census data as primary variable of the Case 1
PPT Slide
Lager Image
The theoretical semi-variogram, (a) exponential model, (b) spherical model, and (c) Gaussian model of the residential area as secondary variable of the Case 1
PPT Slide
Lager Image
The theoretical semi-variogram of cross-variogram between the primary and the secondary variable of the Case 1
4. Results
- 4.1 Results of the dasymetric interpolation mapping
The results of population distribution using the different models were counted by the census units for comparison each other. All models have a similar average value of the estimated population compared to the ordinary/original data. However, the maximum value of the population decreased while its minimum value increased in all models. The degree of change is bigger in the Kriging model than in the RK model because the Kriging model uses distance functions when it estimates new variables ( Fig. 7 ). Fig. 6 reveals the predicted population distribution in the study area. The range of y-values are different because the two fi gures have different calculation method: Fig. 6 is about population density per residential units and Fig. 7 is about population for comparison with census data.
PPT Slide
Lager Image
Predicted population density for the Case 1
PPT Slide
Lager Image
Box-Whisker plots of the predicted population density
- 4.2 Model evaluation and validation
Evaluation of the accuracy and validation were the basis on the root mean square error (RMSE), mean absolute error (MAE), goodness of prediction statistic (G statistic) ( Kravchenko and Bullock, 1999 ; Guisan and Zimmermann, 2000 ; Eldeiry and Garcia, 2010 ; Kim ., 2010 ) and correlation coeffi cient (ρ). The RMSE can be defi ned as Eq. (3) and the MAE can be calculated as Eq. (4).
PPT Slide
Lager Image
PPT Slide
Lager Image
where
PPT Slide
Lager Image
is the predicted value of population at location i , p i is the ordinary/original value of the population at location i , where i =1, 2, 3, ⋯⋯., N.
The effectiveness of the models was measured by the G statistic that can be written as Eq. (5).
PPT Slide
Lager Image
where p i is the ordinary/original value of the population at location i ,
PPT Slide
Lager Image
is the predicted value of population at location i , and
PPT Slide
Lager Image
is the mean of the population in the sample area. The model is more effi cient when the G statistic has a positive value close to 1. The model is not very effi cient when the G statistic has a negative value.
Correlation coeffi cient (ρ) measures the pattern between the ordinary/original value and the predicted value. Every statistical value of the RK model, the Kriging model, and the regression model, following this order, shows better results. As for the OK model and CK model, both show similar results in this study. Tables 2 and 3 show the evaluation and validation values on the models.
Evaluation and validation values using the regression, RK, OK, and CK models (Case 1)
PPT Slide
Lager Image
Evaluation and validation values using the regression, RK, OK, and CK models (Case 1)
Evaluation and validation values using the regression, RK, OK, and CK models (Case 2)
PPT Slide
Lager Image
Evaluation and validation values using the regression, RK, OK, and CK models (Case 2)
PPT Slide
Lager Image
NRMSE values for the Case 1
- 4.3 Zonal errors in population estimations
The normalized root mean square error (NRMSE) is utilized to investigate the prediction error of a specifi c unit area. As a result, the RK model’s estimation error is large in a low population density zone and small in a high population density zone. Among them, the area with the highest estimation error is revealed to be in the bordering sample areas, even in low population density areas.
Results of estimating the population distribution using the OK and CK model produce high error in boundaries or urban areas with small residential districts in the sample region. Consequently, a very distinct area gap compared to surrounding areas created big errors.
5. Discussion and Conclusions
In the case of Dongdaemun-gu and Jungnang-gu (Case 1), every statistic value of the RK model, Kriging model and regression model showed better results than other models. The OK model and CK model always showed similar results. For Mapo-gu and Seodaemun-gu (Case 2), the statistical results of the RK model, CK model, OK model, and the regression model shows better results in the order named. However, the difference in statistical results between the Kriging models and the RK model was not statistically significant.
It is hard to estimate the population distribution using previous weighting methods for dasymetric interpolation mapping, such as the areal weighting method ( Goodchild ., 1993 ) or the population proportion method ( Eicher and Brewer, 2001 ), of the area with land use pattern of the high complexity. The RK model has higher accuracy using the study area compared to the regression, OK, and CK models because the RK model has both advantages of the regression model and the Kriging model. And estimated population from the RK method has similar values of descriptive statistics as the ordinary/original data. However, the forms of the model in conjunction with different spatial statistical models including RK model involve complicated calculations.
This study provides that RK model can be an alternative method of estimating a population distribution although the Kriging model is frequently used for interpolation. The RK model is suitable for areas with a high population density and a positively high correlation between target data and source data. Therefore, the RK model will be useful for metropolitan areas with a high population density.
References
Akaike H. , Krishnaiah P. R. 1977 On entropy maximization principle Proceedings of the Symposium on Applications of Statistics North-Holland, Amsterdam 27 - 41
Choe J. 2007 Geostatistics Sigma-press Seoul (in Korean)
Eicher C.L. , Brewer C.A. 2001 Dasymetric mapping and areal interpolation implementation and evaluation Cartography and Geographic Information Science 28 (2) 125 - 138
Eldeiry A.A. , Garcia L.A. 2010 Comparison of ordinary kriging, regression kriging, and cokriging techniques to estimate soil salinity using Landsat images Journal of Irrigation and Drainage Engineering 136 (6) 355 - 364
Flowerdew R. , Green M. 1992 Developments in areal interpolation methods and GIS The Annals of Regional Science 26 (1) 67 - 78
Goodchild M.F. , Anselin L. , Deichmann U. 1993 A framework for the areal interpolation of socioeconomic data Environment and Planning A 25 (3) 383 - 397
Guisan A. , Zimmermann N.E. 2000 Predictive habitat distribution models in ecology Ecological Modelling 135 (2) 147 - 186
Hengl T. , Heuvelink G.B.M. , Stein A. 2004 A generic framework for spatial prediction of soil variables based on regression-kriging Geoderma 120 (1) 75 - 93
Hengl T. , Heuvelink G.B.M. , Rossiter D.G. 2007 About regression-kriging from equations to case studies Computers & Geosciences 33 (10) 1301 - 1315
Kim B. , Ku C. , Choi J. 2010 Population distribution estimation using regression-kriging model Journal of the Korean Geographical Society (in Korean with English abstract) 45 (6) 806 - 819
Knotters M. , Brus D.J. , Oude Voshaar J.H. 1995 A comparison of kriging, co-kriging and kriging combined with regression for spatial interpolation of horizon depth with censored observations Geoderma 67 (3) 227 - 246
Kravchenko A. , Bullock D.G. 1999 A comparative study of interpolation methods for mapping soil properties Agronomy Journal 91 (3) 393 - 400
Ku C. 2008 A study on estimating the population in urban area with high resolution satellite image The Geographic Journal of Korea (in Korean with English abstract) 42 (1) 137 - 148
Langford M. , Unwin D.J. 1994 Generating and mapping population density surfaces within a geographical information system The Cartographic Journal 31 (1) 21 - 26
Lee S. , Kim K. 2007 Representing the population density distribution of Seoul using dasymetric mapping techniques in a GIS environment Journal of the Korean Cartographic Association (in Korean with English abstract) 7 (2) 53 - 67
Lwin K. , Murayama Y. 2009 A GIS approach to estimation of building population for micro-spatial analysis Transactions in GIS 13 (4) 401 - 414
Maantay J.A. , Maroko A.R. , Herrmann C. 2007 Mapping population distribution in the urban environment the cadastral-based expert dasymetric system (CEDS) Cartography and Geographic Information Science 34 (2) 77 - 102
Mennis J. 2003 Generating surface models of population using dasymetric mapping The Professional Geographer 55 (1) 31 - 42
Mennis J. , Hultgren T. 2006 Intelligent dasymetric mapping and its application to areal interpolation Cartography and Geographic Information Science 33 (3) 179 - 194
Monmonier M. , Schnell G. 1984 Land use and land cover data and the mapping of population density International Yearbook of Cartography 24 115 - 121
Oh D. 2013 A Comparative Study on the Spatial Statistical Models for the Estimation of Population Distribution, Master’s thesis Kyung Hee University Seoul, Korea (in Korean with English abstract) 74 -
Openshaw S. 1984 The Modifiable Areal Unit Problem Geo Books, Norwick Norfolk
Reibel M. , Agrawal A. 2007 Areal interpolation of population counts using pre-classified land cover data Population Research and Policy Review 26 (5-6) 619 - 633
Reibel M. , Bufalino M.E. 2005 Street-weighted interpolation techniques for demographic count estimation in incompatible zone systems Environment and Planning A 37 (1) 127 - 139
Tapp A.F. 2010 Areal interpolation and dasymetric mapping methods using local ancillary data sources Cartography and Geographic Information Science 37 (3) 215 - 228
Wu C. , Murray A.T. 2005 A cokriging method for estimating population density in urban areas Computers, Environment and Urban Systems 29 (5) 558 - 579
Zandbergen P.A. 2011 Dasymetric mapping using high resolution address point datasets Transactions in GIS 15 (s1) 5 - 27