Wednesday, October 19, 2022 From rOpenSci (https://ropensci.org/blog/2022/10/19/censo2017-one-year-after/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
This post is about the surprising uses I’ve noticed and the questions about the censo2017 R package, a tool for accessing the Chilean census 2017 data, I’ve gotten since it was peer-reviewed through rOpenSci one year ago. The original post about the package one year ago didn’t cover the different examples I present here, including a Python port of the R package.
Three years ago, I had to complete an assignment that required me to extract data from Windows-only software in DVD format, which got very complicated.
I needed to access REDATAM files and obtain a few population summaries with specific software for that format. To my surprise, the task got incredibly challenging, and I exported the data to SQL for more accessible data extraction.
What I initially organized in PostgreSQL, ended up being organized in a local (embedded) database, which is more convenient for end-users and can be installed with two lines of code.
Installation in R:
remotes::install_github("ropensci/censo2017")
censo2017::censo_descargar()
Installation in Python:
pip install git+https://github.com/pachadotdev/censo2017python.git#egg=censo2017
import censo2017
censo2017.descargar()
We can use the census data to determine where substandard housing is common, leading to valuable insights for public policy. That is something I never contemplated when creating the software.
What kind of housing information can we look for in the census? To keep
it simple, let’s explore the p01
variable, which we can look at in the
variables_codificacion
, a table that I created to organize my work.
library(censo2017)
library(dplyr)
con <- censo_conectar()
tbl(con, "variables") %>%
filter(variable == "p01") %>%
collect() %>%
kable()
tabla | variable | descripcion | tipo | rango |
---|---|---|---|---|
viviendas | p01 | Tipo de Vivienda | integer | 1 - 10 |
tbl(con, "variables_codificacion") %>%
filter(variable == "p01") %>%
select(valor, descripcion) %>%
collect() %>%
kable()
valor | descripcion |
---|---|
1 | Casa |
2 | Departamento en Edificio |
3 | Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras) |
4 | Pieza en Casa Antigua o en Conventillo |
5 | Mediagua, Mejora, Rancho o Choza |
6 | Móvil (Carpa, Casa Rodante o Similar) |
7 | Otro Tipo de Vivienda Particular |
8 | Vivienda Colectiva |
9 | Operativo Personas en Tránsito (No Es Vivienda) |
10 | Operativo Calle (No Es Vivienda) |
11 | Valor Perdido |
0 | No Aplica |
The variable refers to “Tipo de Vivienda” (Housing Type) and contains ten numeric values that mean:
Value | Description |
---|---|
1 | House |
2 | Apartment Building |
3 | Traditional Indigenous Dwelling (Ruka1 or Others) |
4 | Bedroom in an Old House or in a Conventillo2 |
5 | Half Hut, Improvement, Ranch or Shack |
6 | Mobile (Tent, Mobile Home or Similar) |
7 | Other Type of Private Housing |
8 | Collective Housing |
9 | Persons in Transit Operative (Not a Dwelling) |
10 | Street Operative (Not Housing) |
11 | Missing Value |
0 | Not applicable |
p01
and count the number of units.For example, if we want to look at the 14th region specifically, one way to do this could be:
tbl(con, "zonas") %>%
mutate(region = str_sub(geocodigo, 1, 2)) %>%
filter(region == 14) %>%
inner_join(tbl(con, "viviendas")) %>%
group_by(p01) %>%
count() %>%
collect() %>%
kable()
p01 | n |
---|---|
1 | 144081 |
2 | 4793 |
3 | 44 |
4 | 494 |
5 | 2187 |
6 | 35 |
7 | 1875 |
8 | 467 |
9 | 6 |
10 | 8 |
We can improve this a little bit with an optional join and a step to extract the percentages:
tbl(con, "zonas") %>%
mutate(region = str_sub(geocodigo, 1, 2)) %>%
filter(region == "14") %>%
inner_join(tbl(con, "viviendas")) %>%
group_by(p01) %>%
count() %>%
left_join(
tbl(con, "variables_codificacion") %>%
filter(variable == "p01") %>%
select(p01 = valor, p01_desc = descripcion)
) %>%
select(p01, p01_desc, n) %>%
arrange(-n) %>%
ungroup() %>%
mutate(p = round(100 * n / sum(n), 2)) %>%
collect() %>%
kable()
p01 | p01_desc | n | p |
---|---|---|---|
1 | Casa | 144081 | 93.57 |
2 | Departamento en Edificio | 4793 | 3.11 |
5 | Mediagua, Mejora, Rancho o Choza | 2187 | 1.42 |
7 | Otro Tipo de Vivienda Particular | 1875 | 1.22 |
4 | Pieza en Casa Antigua o en Conventillo | 494 | 0.32 |
8 | Vivienda Colectiva | 467 | 0.30 |
3 | Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras) | 44 | 0.03 |
6 | Móvil (Carpa, Casa Rodante o Similar) | 35 | 0.02 |
10 | Operativo Calle (No Es Vivienda) | 8 | 0.01 |
9 | Operativo Personas en Tránsito (No Es Vivienda) | 6 | 0.00 |
For the 14th region, housing quality seems not to be a problem. What if we wonder for the number of people living in each of these unit types? This would need additional steps to match people (“personas”) to homes (“hogares”), and homes to households.
tbl(con, "zonas") %>%
mutate(region = str_sub(geocodigo, 1, 2)) %>%
filter(region == "14") %>%
inner_join(tbl(con, "viviendas")) %>%
inner_join(tbl(con, "hogares")) %>%
inner_join(tbl(con, "personas")) %>%
group_by(p01) %>%
count() %>%
left_join(
tbl(con, "variables_codificacion") %>%
filter(variable == "p01") %>%
select(p01 = valor, p01_desc = descripcion)
) %>%
select(p01, p01_desc, n) %>%
arrange(-n) %>%
ungroup() %>%
mutate(p = round(100 * n / sum(n), 2)) %>%
collect() %>%
kable()
p01 | p01_desc | n | p |
---|---|---|---|
1 | Casa | 359343 | 93.38 |
2 | Departamento en Edificio | 9139 | 2.37 |
8 | Vivienda Colectiva | 7905 | 2.05 |
5 | Mediagua, Mejora, Rancho o Choza | 3980 | 1.03 |
7 | Otro Tipo de Vivienda Particular | 3008 | 0.78 |
4 | Pieza en Casa Antigua o en Conventillo | 727 | 0.19 |
9 | Operativo Personas en Tránsito (No Es Vivienda) | 494 | 0.13 |
3 | Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras) | 96 | 0.02 |
10 | Operativo Calle (No Es Vivienda) | 84 | 0.02 |
6 | Móvil (Carpa, Casa Rodante o Similar) | 61 | 0.02 |
How does the previous result compare with the capitol (13th region)?
tbl(con, "zonas") %>%
mutate(region = str_sub(geocodigo, 1, 2)) %>%
filter(region == "13") %>%
inner_join(tbl(con, "viviendas")) %>%
inner_join(tbl(con, "hogares")) %>%
inner_join(tbl(con, "personas")) %>%
group_by(p01) %>%
count() %>%
left_join(
tbl(con, "variables_codificacion") %>%
filter(variable == "p01") %>%
select(p01 = valor, p01_desc = descripcion)
) %>%
select(p01, p01_desc, n) %>%
arrange(-n) %>%
ungroup() %>%
mutate(p = round(100 * n / sum(n), 2)) %>%
collect() %>%
kable()
p01 | p01_desc | n | p |
---|---|---|---|
1 | Casa | 5265416 | 74.03 |
2 | Departamento en Edificio | 1613157 | 22.68 |
8 | Vivienda Colectiva | 78763 | 1.11 |
4 | Pieza en Casa Antigua o en Conventillo | 72142 | 1.01 |
5 | Mediagua, Mejora, Rancho o Choza | 52710 | 0.74 |
7 | Otro Tipo de Vivienda Particular | 17335 | 0.24 |
9 | Operativo Personas en Tránsito (No Es Vivienda) | 9113 | 0.13 |
10 | Operativo Calle (No Es Vivienda) | 2740 | 0.04 |
3 | Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras) | 865 | 0.01 |
6 | Móvil (Carpa, Casa Rodante o Similar) | 567 | 0.01 |
From the previous summarised data we can see that in the capitol, 52,710 individuals are living in substandard housing (Mediagua, Mejora, Rancho o Choza).
We shouldn’t forget to close the database connection.
censo_desconectar()
The motivation for this example is to show that the goal of organizing the census data was not to promote R, but to show some of the benefits of having the data organized in an open format, regardless of the programming language used for the analysis. For me it’s easier to complete this example in R, but I wanted to include those using this census data and Python.
The variable for the water source is p05
. The analysis is very similar
to the previous one, and we start by looking at its description. This particular
example is relevant, as it allows to quantify the number of people that might
be exposed to water contamination in a certain area.
import censo2017
import duckdb
con = duckdb.connect(database = censo2017.archivo_sql())
con.execute("SELECT * FROM variables_codificacion WHERE variable = 'p05'").df()
## tabla variable valor descripcion
## 0 viviendas p05 1 Red Pública
## 1 viviendas p05 2 Pozo o Noria
## 2 viviendas p05 3 Camión Aljibe
## 3 viviendas p05 4 Río, Vertiente, Estero, Canal, Lago, Etc.
## 4 viviendas p05 99 Valor Perdido
## 5 viviendas p05 98 No Aplica
Value | Description |
---|---|
1 | Public Network |
2 | Well or Noria |
3 | Cistern Truck |
4 | River, Stream, Estuary, Canal, Lake, Etc. |
99 | Missing Value |
98 | Not Applicable |
By using this variable, we can know the number of people for each water source in the same way as the previous case.
con.execute("""
SELECT "p05", "p05_desc", "n",
ROUND((100.0 * "n") / SUM("n") OVER (), CAST(ROUND(2.0, 0) AS INTEGER)) AS "p"
FROM (
SELECT "LHS"."p05" AS "p05", "n", "p05_desc"
FROM (
SELECT "p05", COUNT(*) AS "n"
FROM (
SELECT *, "LHS"."hogar_ref_id" AS "hogar_ref_id"
FROM (
SELECT *, "LHS"."vivienda_ref_id" AS "vivienda_ref_id"
FROM (
SELECT *, "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id"
FROM (
SELECT *
FROM (
SELECT *, SUBSTR("geocodigo", 1, 2) AS "region"
FROM "zonas"
) "q01"
WHERE ("region" = '14')
) "LHS"
INNER JOIN "viviendas" AS "RHS"
ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
) "LHS"
INNER JOIN "hogares" AS "RHS"
ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
) "LHS"
INNER JOIN "personas" AS "RHS"
ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
) "q02"
GROUP BY "p05"
) "LHS"
LEFT JOIN (
SELECT "valor" AS "p05", "descripcion" AS "p05_desc"
FROM "variables_codificacion"
WHERE ("variable" = 'p05')
) "RHS"
ON ("LHS"."p05" = "RHS"."p05")
) "q03"
ORDER BY -"n"
""").df()
## p05 p05_desc n p
## 0 1 Red Pública 299464 77.82
## 1 4 Río, Vertiente, Estero, Canal, Lago, Etc. 38796 10.08
## 2 2 Pozo o Noria 32150 8.35
## 3 98 No Aplica 8483 2.20
## 4 3 Camión Aljibe 3558 0.92
## 5 99 Valor Perdido 2386 0.62
Where we see that for the 14th region, 22.18% of the population doesn’t
have access to the public tap water network. The advantage of this
approach is that here we used pure SQL for querying, which would work in
any language where we can pass SQL queries. The disadvantage is not
having a tool such as dplyr
and the need to learn the SQL syntax.
We shouldn’t forget to close the database connection.
con.close()
The examples shown here were simplified. Most of the inquiries I get are about data summaries at the sub-communal level, for example how to count the occurrences of a categorical variable in a neighborhood, as we did in our example of water access (i.e., well or public network). By using censo2017 we can extract useful information for evidence-based public policy when it comes to, for example, prioritizing budget decisions.