rOpenSci | Interesting Uses of censo2017 a Year After Publishing

Interesting Uses of censo2017 a Year After Publishing

🔗 Summary

This post is about the surprising uses I’ve noticed and the questions about the censo2017 R package, a tool for accessing the Chilean census 2017 data, I’ve gotten since it was peer-reviewed through rOpenSci one year ago. The original post about the package one year ago didn’t cover the different examples I present here, including a Python port of the R package.

🔗 Organizing the census data

Three years ago, I had to complete an assignment that required me to extract data from Windows-only software in DVD format, which got very complicated.

I needed to access REDATAM files and obtain a few population summaries with specific software for that format. To my surprise, the task got incredibly challenging, and I exported the data to SQL for more accessible data extraction.

What I initially organized in PostgreSQL, ended up being organized in a local (embedded) database, which is more convenient for end-users and can be installed with two lines of code.

Installation in R:

remotes::install_github("ropensci/censo2017")
censo2017::censo_descargar()

Installation in Python:

pip install git+https://github.com/pachadotdev/censo2017python.git#egg=censo2017

import censo2017
censo2017.descargar()

🔗 Unexpected use I: Substandard housing (in R)

We can use the census data to determine where substandard housing is common, leading to valuable insights for public policy. That is something I never contemplated when creating the software.

What kind of housing information can we look for in the census? To keep it simple, let’s explore the p01 variable, which we can look at in the variables_codificacion, a table that I created to organize my work.

library(censo2017)
library(dplyr)

con <- censo_conectar()

tbl(con, "variables") %>% 
  filter(variable == "p01") %>% 
  collect() %>% 
  kable()
tablavariabledescripciontiporango
viviendasp01Tipo de Viviendainteger1 - 10
tbl(con, "variables_codificacion") %>% 
  filter(variable == "p01") %>% 
  select(valor, descripcion) %>% 
  collect() %>% 
  kable()
valordescripcion
1Casa
2Departamento en Edificio
3Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras)
4Pieza en Casa Antigua o en Conventillo
5Mediagua, Mejora, Rancho o Choza
6Móvil (Carpa, Casa Rodante o Similar)
7Otro Tipo de Vivienda Particular
8Vivienda Colectiva
9Operativo Personas en Tránsito (No Es Vivienda)
10Operativo Calle (No Es Vivienda)
11Valor Perdido
0No Aplica

The variable refers to “Tipo de Vivienda” (Housing Type) and contains ten numeric values that mean:

ValueDescription
1House
2Apartment Building
3Traditional Indigenous Dwelling (Ruka1 or Others)
4Bedroom in an Old House or in a Conventillo2
5Half Hut, Improvement, Ranch or Shack
6Mobile (Tent, Mobile Home or Similar)
7Other Type of Private Housing
8Collective Housing
9Persons in Transit Operative (Not a Dwelling)
10Street Operative (Not Housing)
11Missing Value
0Not applicable

  1. traditional Mapuche house type↩︎

  2. A building designed to house as many people as possible in a small space at a low cost.↩︎

In this example, I consider dwellings coded as 5 (Half Hut, Improvement, Ranch or Shack / Mediagua, Mejora, Rancho o Choza) as substandard. Because of the census data organization, to get the share of these types of housing per region, we need to:
  1. Create the code for each political division (i.e., region) from the geographical code in the zones table.
  2. Join the zones (“zonas”) table with the households (“viviendas”) table to match each unit to its location (i.e., household ID 100 is in the “Los Ríos (14th)” region).
  3. Group by p01 and count the number of units.

For example, if we want to look at the 14th region specifically, one way to do this could be:

tbl(con, "zonas") %>% 
  mutate(region = str_sub(geocodigo, 1, 2)) %>% 
  filter(region == 14) %>% 
  
  inner_join(tbl(con, "viviendas")) %>% 
  
  group_by(p01) %>% 
  count() %>% 
  
  collect() %>% 
  kable()
p01n
1144081
24793
344
4494
52187
635
71875
8467
96
108

We can improve this a little bit with an optional join and a step to extract the percentages:

tbl(con, "zonas") %>% 
  mutate(region = str_sub(geocodigo, 1, 2)) %>% 
  filter(region == "14") %>% 
  
  inner_join(tbl(con, "viviendas")) %>% 
  
  group_by(p01) %>% 
  count() %>% 
  
  left_join(
    tbl(con, "variables_codificacion") %>% 
      filter(variable == "p01") %>% 
      select(p01 = valor, p01_desc = descripcion)
  ) %>% 
  select(p01, p01_desc, n) %>% 
  
  arrange(-n) %>% 
  ungroup() %>% 
  mutate(p = round(100 * n / sum(n), 2)) %>% 
  
  collect() %>% 
  kable()
p01p01_descnp
1Casa14408193.57
2Departamento en Edificio47933.11
5Mediagua, Mejora, Rancho o Choza21871.42
7Otro Tipo de Vivienda Particular18751.22
4Pieza en Casa Antigua o en Conventillo4940.32
8Vivienda Colectiva4670.30
3Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras)440.03
6Móvil (Carpa, Casa Rodante o Similar)350.02
10Operativo Calle (No Es Vivienda)80.01
9Operativo Personas en Tránsito (No Es Vivienda)60.00

For the 14th region, housing quality seems not to be a problem. What if we wonder for the number of people living in each of these unit types? This would need additional steps to match people (“personas”) to homes (“hogares”), and homes to households.

tbl(con, "zonas") %>% 
  mutate(region = str_sub(geocodigo, 1, 2)) %>% 
  filter(region == "14") %>% 
  
  inner_join(tbl(con, "viviendas")) %>% 
  
  inner_join(tbl(con, "hogares")) %>% 
  inner_join(tbl(con, "personas")) %>% 
  
  group_by(p01) %>% 
  count() %>% 
  
  left_join(
    tbl(con, "variables_codificacion") %>% 
      filter(variable == "p01") %>% 
      select(p01 = valor, p01_desc = descripcion)
  ) %>% 
  select(p01, p01_desc, n) %>% 
  
  arrange(-n) %>% 
  ungroup() %>% 
  mutate(p = round(100 * n / sum(n), 2)) %>% 
  
  collect() %>% 
  kable()
p01p01_descnp
1Casa35934393.38
2Departamento en Edificio91392.37
8Vivienda Colectiva79052.05
5Mediagua, Mejora, Rancho o Choza39801.03
7Otro Tipo de Vivienda Particular30080.78
4Pieza en Casa Antigua o en Conventillo7270.19
9Operativo Personas en Tránsito (No Es Vivienda)4940.13
3Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras)960.02
10Operativo Calle (No Es Vivienda)840.02
6Móvil (Carpa, Casa Rodante o Similar)610.02

How does the previous result compare with the capitol (13th region)?

tbl(con, "zonas") %>% 
  mutate(region = str_sub(geocodigo, 1, 2)) %>% 
  filter(region == "13") %>% 
  
  inner_join(tbl(con, "viviendas")) %>% 
  
  inner_join(tbl(con, "hogares")) %>% 
  inner_join(tbl(con, "personas")) %>% 
  
  group_by(p01) %>% 
  count() %>% 
  
  left_join(
    tbl(con, "variables_codificacion") %>% 
      filter(variable == "p01") %>% 
      select(p01 = valor, p01_desc = descripcion)
  ) %>% 
  select(p01, p01_desc, n) %>% 
  
  arrange(-n) %>% 
  ungroup() %>% 
  mutate(p = round(100 * n / sum(n), 2)) %>% 
  
  collect() %>% 
  kable()
p01p01_descnp
1Casa526541674.03
2Departamento en Edificio161315722.68
8Vivienda Colectiva787631.11
4Pieza en Casa Antigua o en Conventillo721421.01
5Mediagua, Mejora, Rancho o Choza527100.74
7Otro Tipo de Vivienda Particular173350.24
9Operativo Personas en Tránsito (No Es Vivienda)91130.13
10Operativo Calle (No Es Vivienda)27400.04
3Vivienda Tradicional Indígena (Ruka, Pae Pae u Otras)8650.01
6Móvil (Carpa, Casa Rodante o Similar)5670.01

From the previous summarised data we can see that in the capitol, 52,710 individuals are living in substandard housing (Mediagua, Mejora, Rancho o Choza).

We shouldn’t forget to close the database connection.

censo_desconectar()

🔗 Unexpected use II: Water source (in Python)

The motivation for this example is to show that the goal of organizing the census data was not to promote R, but to show some of the benefits of having the data organized in an open format, regardless of the programming language used for the analysis. For me it’s easier to complete this example in R, but I wanted to include those using this census data and Python.

The variable for the water source is p05. The analysis is very similar to the previous one, and we start by looking at its description. This particular example is relevant, as it allows to quantify the number of people that might be exposed to water contamination in a certain area.

import censo2017
import duckdb

con = duckdb.connect(database = censo2017.archivo_sql())

con.execute("SELECT * FROM variables_codificacion WHERE variable = 'p05'").df()

##        tabla variable  valor                                descripcion
## 0  viviendas      p05      1                                Red Pública
## 1  viviendas      p05      2                               Pozo o Noria
## 2  viviendas      p05      3                              Camión Aljibe
## 3  viviendas      p05      4  Río, Vertiente, Estero, Canal, Lago, Etc.
## 4  viviendas      p05     99                              Valor Perdido
## 5  viviendas      p05     98                                  No Aplica
ValueDescription
1Public Network
2Well or Noria
3Cistern Truck
4River, Stream, Estuary, Canal, Lake, Etc.
99Missing Value
98Not Applicable

By using this variable, we can know the number of people for each water source in the same way as the previous case.

con.execute("""
SELECT "p05", "p05_desc", "n",
  ROUND((100.0 * "n") / SUM("n") OVER (), CAST(ROUND(2.0, 0) AS INTEGER)) AS "p"
FROM (
  SELECT "LHS"."p05" AS "p05", "n", "p05_desc"
  FROM (
    SELECT "p05", COUNT(*) AS "n"
    FROM (
      SELECT *, "LHS"."hogar_ref_id" AS "hogar_ref_id"
      FROM (
        SELECT *, "LHS"."vivienda_ref_id" AS "vivienda_ref_id"
        FROM (
          SELECT *, "LHS"."zonaloc_ref_id" AS "zonaloc_ref_id"
          FROM (
            SELECT *
            FROM (
              SELECT *, SUBSTR("geocodigo", 1, 2) AS "region"
              FROM "zonas"
            ) "q01"
            WHERE ("region" = '14')
          ) "LHS"
          INNER JOIN "viviendas" AS "RHS"
            ON ("LHS"."zonaloc_ref_id" = "RHS"."zonaloc_ref_id")
        ) "LHS"
        INNER JOIN "hogares" AS "RHS"
          ON ("LHS"."vivienda_ref_id" = "RHS"."vivienda_ref_id")
      ) "LHS"
      INNER JOIN "personas" AS "RHS"
        ON ("LHS"."hogar_ref_id" = "RHS"."hogar_ref_id")
    ) "q02"
    GROUP BY "p05"
  ) "LHS"
  LEFT JOIN (
    SELECT "valor" AS "p05", "descripcion" AS "p05_desc"
    FROM "variables_codificacion"
    WHERE ("variable" = 'p05')
  ) "RHS"
    ON ("LHS"."p05" = "RHS"."p05")
) "q03"
ORDER BY -"n"
""").df()

##    p05                                   p05_desc       n      p
## 0    1                                Red Pública  299464  77.82
## 1    4  Río, Vertiente, Estero, Canal, Lago, Etc.   38796  10.08
## 2    2                               Pozo o Noria   32150   8.35
## 3   98                                  No Aplica    8483   2.20
## 4    3                              Camión Aljibe    3558   0.92
## 5   99                              Valor Perdido    2386   0.62

Where we see that for the 14th region, 22.18% of the population doesn’t have access to the public tap water network. The advantage of this approach is that here we used pure SQL for querying, which would work in any language where we can pass SQL queries. The disadvantage is not having a tool such as dplyr and the need to learn the SQL syntax.

We shouldn’t forget to close the database connection.

con.close()

🔗 Concluding remarks

The examples shown here were simplified. Most of the inquiries I get are about data summaries at the sub-communal level, for example how to count the occurrences of a categorical variable in a neighborhood, as we did in our example of water access (i.e., well or public network). By using censo2017 we can extract useful information for evidence-based public policy when it comes to, for example, prioritizing budget decisions.