A Comparative Study on the Spatial Statistical Models for the Estimation of Population Distribution

Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography.
2015.
Jun,
33(3):
145-153

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

- Received : April 04, 2015
- Accepted : June 06, 2015
- Published : June 30, 2015

Download

PDF

e-PUB

PubReader

PPT

Export by style

Share

Article

Metrics

Cited by

TagCloud

This study aims to accurately estimate population distribution more specifically than administrative unites using a RK (Regression-Kriging) model. The RK model is the areal interpolation technique that involves linear regression and the Kriging model. In order to estimate a population’s distribution using a sample region, four different models were used, namely; a regression model, RK model, OK (Ordinary Kriging) model and CK (Co-Kriging) model. The results were then compared with each other. Evaluation of the accuracy and validity of evaluation analysis results were the basis RMSE (Root Mean Square Error), MAE (Mean Absolute Error), G statistic and correlation coefficient (ρ). In the sample regions, every statistic value of the RK model showed better results than other models. The results of this comparative study will be useful to estimate a population distribution of the metropolitan areas with high population density
Suppose that
is the predicted population density at the location,
s
_{0}
,
is the predicted value from the regression model, and
is the predicted value using the Kriging model. This equation can also be expressed as Eq. (2).
where
is the estimated regression coefficient,
q
_{k}
is the independent variable,
ω
_{i}
is the Kriging weight and
e
(
s
_{i}
) is the residual by the regression model at the location,
s
_{i}
.
The RK model goes through the following process: Step 1. Correlation coefficients were used in order to determine the relationship of the combination of land uses and population density so the combination with the highest correlation coefficients is selected. Step 2. Ordinary least squares (OLS) multiple regression was used to generate the surface of the population density. In performing a regression analysis, if a spatial autocorrelation is shown in the residuals, a generalized least squares (GLS) regression analysis can supplement OLS multiple regression. Step 3. The values from the regression analysis and the kriged residuals are added, the result will be the estimated population distribution
The theoretical semi-variogram, (a) exponential model, (b) spherical model, and (c) Gaussian model of the regression residuals of the Case 1
The theoretical semi-variogram, (a) exponential model, (b) spherical model, and (c) Gaussian model of the regression residuals of the Case 2
Significance level: ***: 0, **: 0.05, *:0.1
The theoretical semi-variogram, (a) exponential model, (b) spherical model, and (c) Gaussian model of the census data as primary variable of the Case 1
The theoretical semi-variogram, (a) exponential model, (b) spherical model, and (c) Gaussian model of the residential area as secondary variable of the Case 1
The theoretical semi-variogram of cross-variogram between the primary and the secondary variable of the Case 1
Predicted population density for the Case 1
Box-Whisker plots of the predicted population density
where
is the predicted value of population at location
i
,
p
_{i}
is the ordinary/original value of the population at location
i
, where
i
=1, 2, 3, ⋯⋯., N.
The effectiveness of the models was measured by the G statistic that can be written as Eq. (5).
where
p
_{i}
is the ordinary/original value of the population at location
i
,
is the predicted value of population at location
i
, and
is the mean of the population in the sample area. The model is more effi cient when the G statistic has a positive value close to 1. The model is not very effi cient when the G statistic has a negative value.
Correlation coeffi cient (ρ) measures the pattern between the ordinary/original value and the predicted value. Every statistical value of the RK model, the Kriging model, and the regression model, following this order, shows better results. As for the OK model and CK model, both show similar results in this study.
Tables 2
and
3
show the evaluation and validation values on the models.
Evaluation and validation values using the regression, RK, OK, and CK models (Case 1)
Evaluation and validation values using the regression, RK, OK, and CK models (Case 2)
NRMSE values for the Case 1

1. Introduction

Many complicated researches need the spatial data with a finer resolution than the enumeration unit such as administrative boundaries. Aggregated statistical data such as census data can provide a summarized description, but it can also imply some interpretative misinformation.
Openshaw (1984)
claims that statistical sources can be biased because results are the value of summary in arbitrary areal units, which are called Modifiable Areal Unit Problem (MAUP). Another problem of census data is that it assumes a homogeneous population in area (
Monmonier and Schnell, 1984
;
Langford and Unwin, 1994
;
Eicher and Brewer, 2001
). Dasymetric mapping method can be one of the solutions to interpolate the census data because it allows realistic presentation of population (
Eicher and Brewer, 2001
;
Mennis, 2003
;
Mennis and Hultgren, 2006
).
Dasymetric mapping method, the theoretical foundation for the study, has been studied extensively. Many researchers generally consider dasymetric mapping method as one of the areal interpolation methods (
Lee and Kim, 2007
). In mathematical perspectives, it has been mainly addressed as a weight method for population interpolation for the target area. On the other hand, the use of higher resolution ancillary data is of interest in data aspects. Several ancillary data were used for dasymetric interpolation mapping such as residential area that were extracted from high resolution satellite image (
Ku, 2008
;
Wu and Murray, 2005
), cadastral map data (
Maantay ., 2007
;
Tapp, 2010
), streets and roads data (
Reibel and Bufalino, 2005
), building data in downtown (
Lwin and Murayama, 2009
), and the point data of individual residence (
Zandbergen, 2011
). However, these high resolution data is limited in a certain area. It is impossible to conduct a study if there is no data in the study area. Weighting method such as areal weighting (
Mennis and Hultgren, 2006
), population proportion, regression analysis method (
Flowerdew and Green, 1992
;
Reibel and Agrawal, 2007
) in dasymetric mapping has also limitations since it may not reflect the characteristics of spatial data or make higher estimation errors of the results. In order to reduce the error estimation value in this study, different statistical models were used to calculate a weight to estimate the population’s distribution. This study has its own originality in that both RK model and Kriging models are used and compared for population estimation in study area. Also, because previous works (
Wu and Murray, 2005
) used CK model for population estimation in urban area, we used the same model and we used OK and regression models to compare the differences among the other models.
This study aims to estimate the population distribution with more precise spatial resolution using geo-statistical methods. In order to do this, we applied four geo-statistical models and compared the results with each other. Four models are (1) regression model, (2) Regression-Kriging (RK) model, (3) Ordinary Kriging (OK) model and (4) Co-Kriging (CK) model. We investigate the applicability of RK model to estimate the population of the metropolitan area with the high population density because RK model is suitable to predict the subtle changes. In this respect, we considered the properties and the relations between the size of residential area and the population size for the study area. We chose the study area where it is suitable to use RK model to estimate population through testing samples several times. Both of the study area, Dongdaemun-gu and Jungnang-gu (Case 1), and Mapo-gu and Seodaemun-gu (Case 2), have relatively high population density in Seoul, Korea.
We used the land use data of 2000s from the NGII (National Geographic Information Institute) as the ancillary data or the target data in the dasymetric mapping. Two different types of residential area from the land use data were used as independent parameters in the regression analysis. We assumed that the population distribute only in the residential area using categories in land use data. Residential zones include binary area which are high population density area (buildings above 5th floor and subsidiary facilities) and low population density area (buildings below 5th floor and subsidiary facilities). And the smallest administrative areal unit (dong) from the census data of 2000 was used for the source data. Statistical analyses for the study were performed in the R 2.15.2 and then the ArcGIS 10 was used for the spatial analysis and mapping.
2. Approach

- 2.1 Dasymatric interpolation mapping with RK model

The RK model involves the linear regression model that analysis the whole drift/trend between the variables and the Kriging model that interpolates the residuals by regression (
Hengl ., 2004
;
Hengl ., 2007
). For example, the RK model can be written as Eq. (1).
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

- 2.2 Dasymetric interpolation mapping with Kriging model

In order to perform the OK and CK model, the covariance matrix and the variogram are calculated with a primary variable (census data) for the OK model, and a primary variable and a secondary variable (the information of residential area) for the CK model. The primary variable and the secondary variable used in the CK model are interrelated spatially. Among the theoretical semi-variogram, the spherical model shows the best fit using the factors of the semi-variogram in every case in this study. Population weight used on the OK and CK model can be extracted from the semi-variogram model. Refer to
Choe (2007)
and
Knotters . (1995)
for more model explanation. In this study, we performed Kriging for population distribution with continuous variable and reaggregated values of population in each residential zones. In other words, the arithmetic mean of all the points in the grid cell of the estimated population density was calculated for the predictions.
3. Case Studies

- 3.1 A case study with the RK mode

We used the census data and binary residential area of the land use data to facilitate the dasymetric mapping with the RK model. After a correlation analysis between the census data and the binary residential area to determine the regression parameters, regression analysis was performed to generate the trend surface of the population density. GLS regression analysis was used instead of OLS regression based on the Durbin-Watson test. Minimize AIC (Akaike Information Criterion) can be used as criteria for model selection and we applied the AIC to the spherical, exponential, and Gaussian model. Refer to
Akaike (1977)
and
Eldeiry and Garcia (2010)
for more AIC explanation.
Fig. 1
and
Fig. 2
show the theoretical semi-variogram of the residuals and it was fit with an exponential model in Case 1 and a spherical model in Case 2.
Table 1
provides the estimated coefficients by the type of regression models and the values of AIC for model selection in Case 1. The GLS residuals were interpolated using Kriging model to make the minimum and unbiased variance in locality. Estimated population density is the adding value of the results of regression model and the Kriging model.
PPT Slide

Lager Image

PPT Slide

Lager Image

Regression coefficients and AIC of the Case 1

PPT Slide

Lager Image

- 3.2 A case study with the Kriging model

We applied the OK model and the CK model to estimate the population’s distribution using Kriging in the study area. The OK model needs data on population as primary variable, whereas the CK model requires additional secondary variable that has spatial correlation between variables. Grid was created using midpoints of the smallest administrative areal unit to calculate the Kriging weights.
In order to model the population density using Kriging, covariance model has to be employed with primary variable, secondary variable, and cross variable. Utilizing the covariance model, we were able to perform the semi-variogram experiments several times and came up with factors of the semi-variogram. Semi-variogram provides information on the nugget, sill, and range to perform the Kriging. Using these factors, we selected the best model of theoretical semi-variogram.
Fig. 3
,
Fig. 4
, and
Fig. 5
show the theoretical semi-variogram of the census data, residential area data, and cross-variogram between census and residential area, respectively for the Case 1. We also performed in the same procedure for the Case 2. Population weights used on an OK model and a CK model can be extracted from semi-variogram.
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

4. Results

- 4.1 Results of the dasymetric interpolation mapping

The results of population distribution using the different models were counted by the census units for comparison each other. All models have a similar average value of the estimated population compared to the ordinary/original data. However, the maximum value of the population decreased while its minimum value increased in all models. The degree of change is bigger in the Kriging model than in the RK model because the Kriging model uses distance functions when it estimates new variables (
Fig. 7
).
Fig. 6
reveals the predicted population distribution in the study area. The range of y-values are different because the two fi gures have different calculation method:
Fig. 6
is about population density per residential units and
Fig. 7
is about population for comparison with census data.
PPT Slide

Lager Image

PPT Slide

Lager Image

- 4.2 Model evaluation and validation

Evaluation of the accuracy and validation were the basis on the root mean square error (RMSE), mean absolute error (MAE), goodness of prediction statistic (G statistic) (
Kravchenko and Bullock, 1999
;
Guisan and Zimmermann, 2000
;
Eldeiry and Garcia, 2010
;
Kim ., 2010
) and correlation coeffi cient (ρ). The RMSE can be defi ned as Eq. (3) and the MAE can be calculated as Eq. (4).
PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

PPT Slide

Lager Image

Evaluation and validation values using the regression, RK, OK, and CK models (Case 1)

PPT Slide

Lager Image

Evaluation and validation values using the regression, RK, OK, and CK models (Case 2)

PPT Slide

Lager Image

PPT Slide

Lager Image

- 4.3 Zonal errors in population estimations

The normalized root mean square error (NRMSE) is utilized to investigate the prediction error of a specifi c unit area. As a result, the RK model’s estimation error is large in a low population density zone and small in a high population density zone. Among them, the area with the highest estimation error is revealed to be in the bordering sample areas, even in low population density areas.
Results of estimating the population distribution using the OK and CK model produce high error in boundaries or urban areas with small residential districts in the sample region. Consequently, a very distinct area gap compared to surrounding areas created big errors.
5. Discussion and Conclusions

In the case of Dongdaemun-gu and Jungnang-gu (Case 1), every statistic value of the RK model, Kriging model and regression model showed better results than other models. The OK model and CK model always showed similar results. For Mapo-gu and Seodaemun-gu (Case 2), the statistical results of the RK model, CK model, OK model, and the regression model shows better results in the order named. However, the difference in statistical results between the Kriging models and the RK model was not statistically significant.
It is hard to estimate the population distribution using previous weighting methods for dasymetric interpolation mapping, such as the areal weighting method (
Goodchild ., 1993
) or the population proportion method (
Eicher and Brewer, 2001
), of the area with land use pattern of the high complexity. The RK model has higher accuracy using the study area compared to the regression, OK, and CK models because the RK model has both advantages of the regression model and the Kriging model. And estimated population from the RK method has similar values of descriptive statistics as the ordinary/original data. However, the forms of the model in conjunction with different spatial statistical models including RK model involve complicated calculations.
This study provides that RK model can be an alternative method of estimating a population distribution although the Kriging model is frequently used for interpolation. The RK model is suitable for areas with a high population density and a positively high correlation between target data and source data. Therefore, the RK model will be useful for metropolitan areas with a high population density.
Akaike H.
,
Krishnaiah P. R.
1977
On entropy maximization principle
Proceedings of the Symposium on Applications of Statistics
North-Holland, Amsterdam
27 -
41

Choe J.
2007
Geostatistics
Sigma-press
Seoul
(in Korean)

Kim B.
,
Ku C.
,
Choi J.
2010
Population distribution estimation using regression-kriging model
Journal of the Korean Geographical Society
(in Korean with English abstract)
45
(6)
806 -
819

Ku C.
2008
A study on estimating the population in urban area with high resolution satellite image
The Geographic Journal of Korea
(in Korean with English abstract)
42
(1)
137 -
148

Lee S.
,
Kim K.
2007
Representing the population density distribution of Seoul using dasymetric mapping techniques in a GIS environment
Journal of the Korean Cartographic Association
(in Korean with English abstract)
7
(2)
53 -
67

Mennis J.
2003
Generating surface models of population using dasymetric mapping
The Professional Geographer
55
(1)
31 -
42

Monmonier M.
,
Schnell G.
1984
Land use and land cover data and the mapping of population density
International Yearbook of Cartography
24
115 -
121

Oh D.
2013
A Comparative Study on the Spatial Statistical Models for the Estimation of Population Distribution, Master’s thesis
Kyung Hee University
Seoul, Korea
(in Korean with English abstract)
74 -

Openshaw S.
1984
The Modifiable Areal Unit Problem
Geo Books, Norwick Norfolk

Citing 'A Comparative Study on the Spatial Statistical Models for the Estimation of Population Distribution
'

@article{ GCRHBD_2015_v33n3_145}
,title={A Comparative Study on the Spatial Statistical Models for the Estimation of Population Distribution}
,volume={3}
, url={http://dx.doi.org/10.7848/ksgpc.2015.33.3.145}, DOI={10.7848/ksgpc.2015.33.3.145}
, number= {3}
, journal={Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography}
, publisher={Korean Society of Surveying, Geodesy, Photogrammetry and Cartography}
, author={Oh, Doo-Ri
and
Hwang, Chul Sue}
, year={2015}
, month={Jun}