Thursday, June 22, 2017 From rOpenSci (https://ropensci.org/blog/2017/06/22/charlatan/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
charlatan makes fake data.
Excited to annonunce a new package called charlatan
. While perusing
packages from other programming languages, I saw a neat Python library
called faker
.
charlatan
is inspired from and ports many things from Python’s
https://github.com/joke2k/faker library. In turn, faker
was inspired from
PHP’s faker,
Perl’s Faker, and
Ruby’s faker. It appears that the PHP
library was the original - nice work PHP.
What could you do with this package? Here’s some use cases:
charlatan
is language support.
Of course for some data types (numbers), languages don’t come into play, but
for many they do. That means you can create fake datasets specific to a
language, or a dataset with a mix of languages, etc. For the variables
in this package, we have not yet ported over all languages for those
variables that Python’s faker
has.We have not ported every variable, or every language yet in those variables.
We have added some variables to charlatan
that are not in faker
(e.g.,
taxonomy, gene sequences). Check out the issues
to follow progress.
ch_generate
: generate a data.frame with fake datafraudster
: single interface to all fake data methodsch_
that wrap low level interfaces, and are meant to be easier
to use and provide easy way to make many instances of a thing.Check out the package vignette to get started.
Install charlatan
install.packages("charlatan")
Or get the development version:
devtools::install_github("ropensci/charlatan")
library(charlatan)
fraudster
is an interface for all fake data variables (and locales):
x <- fraudster()
x$job()
#> [1] "Textile designer"
x$name()
#> [1] "Cris Johnston-Tremblay"
x$job()
#> [1] "Database administrator"
x$color_name()
#> [1] "SaddleBrown"
If you want to set locale, do so like fraudster(locale = "{locale}")
The locales that are supported vary by data variable. We’re adding more locales through time, so do check in from time to time - or even better, send a pull request adding support for the locale you want for the variable(s) you want.
As an example, you can set locale for job data to any number of supported locales.
For jobs:
ch_job(locale = "en_US", n = 3)
#> [1] "Charity officer" "Financial adviser" "Buyer, industrial"
ch_job(locale = "fr_FR", n = 3)
#> [1] "Illustrateur" "Guichetier"
#> [3] "Responsable d'ordonnancement"
ch_job(locale = "hr_HR", n = 3)
#> [1] "Pomoćnik strojovođe"
#> [2] "Pećar"
#> [3] "Konzervator – restaurator savjetnik"
ch_job(locale = "uk_UA", n = 3)
#> [1] "Фрілансер" "Астрофізик" "Доцент"
ch_job(locale = "zh_TW", n = 3)
#> [1] "包裝設計" "空調冷凍技術人員" "鍋爐操作技術人員"
For colors:
ch_color_name(locale = "en_US", n = 3)
#> [1] "DarkMagenta" "Navy" "LightGray"
ch_color_name(locale = "uk_UA", n = 3)
#> [1] "Синій ВПС" "Темно-зелений хакі" "Берлінська лазур"
charlatan
will tell you when a locale is not supported
ch_job(locale = "cv_MN")
#> Error: cv_MN not in set of available locales
ch_generate()
helps you create data.frame’s with whatever variables
you want that charlatan
supports. Then you’re ready to use the
data.frame immediately in whatever your application is.
By default, you get back a certain set of variables. Right now, that is:
name
, job
, and phone_number
.
ch_generate()
#> # A tibble: 10 x 3
#> name job
#> <chr> <chr>
#> 1 Coy Davis Geneticist, molecular
#> 2 Artis Senger Press sub
#> 3 Tal Rogahn Town planner
#> 4 Nikolas Carter Barrister's clerk
#> 5 Sharlene Kemmer Insurance account manager
#> 6 Babyboy Volkman Quality manager
#> 7 Dr. Josephus Marquardt DVM Best boy
#> 8 Vernal Dare Engineer, site
#> 9 Emilia Hessel Administrator, arts
#> 10 Urijah Beatty Editor, commissioning
#> # ... with 1 more variables: phone_number <chr>
You can select just the variables you want:
ch_generate('job', 'phone_number', n = 30)
#> # A tibble: 30 x 2
#> job phone_number
#> <chr> <chr>
#> 1 Call centre manager 1-670-715-3079x9104
#> 2 Nurse, learning disability 1-502-781-3386x33524
#> 3 Network engineer 1-692-089-3060
#> 4 Industrial buyer 1-517-855-8517
#> 5 Database administrator (999)474-9975x89650
#> 6 Operations geologist 06150655769
#> 7 Engineer, land 360-043-3630x592
#> 8 Pension scheme manager (374)429-6821
#> 9 Personnel officer 1-189-574-3348x338
#> 10 Editor, film/video 1-698-135-1664
#> # ... with 20 more rows
A sampling of the data types available in charlatan
:
person name
ch_name()
#> [1] "Jefferey West-O'Connell"
ch_name(10)
#> [1] "Dylon Hintz" "Dr. Billy Willms DDS" "Captain Bednar III"
#> [4] "Carli Torp" "Price Strosin III" "Grady Mayert"
#> [7] "Nat Herman-Kuvalis" "Noelle Funk" "Dr. Jaycie Herzog MD"
#> [10] "Ms. Andrea Zemlak"
phone number
ch_phone_number()
#> [1] "643.993.1958"
ch_phone_number(10)
#> [1] "+06(6)6080789632" "05108334280" "447-126-9775"
#> [4] "+96(7)2112213020" "495-425-1506" "1-210-372-3188x514"
#> [7] "(300)951-5115" "680.567.5321" "1-947-805-4758x8167"
#> [10] "888-998-5511x554"
job
ch_job()
#> [1] "Scientist, water quality"
ch_job(10)
#> [1] "Engineer, production"
#> [2] "Architect"
#> [3] "Exhibitions officer, museum/gallery"
#> [4] "Patent attorney"
#> [5] "Surveyor, minerals"
#> [6] "Electronics engineer"
#> [7] "Secondary school teacher"
#> [8] "Intelligence analyst"
#> [9] "Nutritional therapist"
#> [10] "Information officer"
Real data is messy! charlatan
makes it easy to create
messy data. This is still in the early stages so is not available
across most data types and languages, but we’re working on it.
For example, create messy names:
ch_name(50, messy = TRUE)
#> [1] "Mr. Vernell Hoppe Jr." "Annika Considine d.d.s."
#> [3] "Dr. Jose Kunde DDS" "Karol Leuschke-Runte"
#> [5] "Kayleen Kutch-Hintz" "Jahir Green"
#> [7] "Stuart Emmerich" "Hillard Schaden"
#> [9] "Mr. Caden Braun" "Willie Ebert"
#> [11] "Meg Abbott PhD" "Dr Rahn Huel"
#> [13] "Kristina Crooks d.d.s." "Lizbeth Hansen"
#> [15] "Mrs. Peyton Kuhn" "Hayley Bernier"
#> [17] "Dr. Lavon Schimmel d.d.s." "Iridian Murray"
#> [19] "Cary Romaguera" "Tristan Windler"
#> [21] "Marlana Schroeder md" "Mr. Treyton Nitzsche"
#> [23] "Hilmer Nitzsche-Glover" "Marius Dietrich md"
#> [25] "Len Mertz" "Mrs Adyson Wunsch DVM"
#> [27] "Dr. Clytie Feest DDS" "Mr. Wong Lebsack I"
#> [29] "Arland Kessler" "Mrs Billy O'Connell m.d."
#> [31] "Stephen Gerlach" "Jolette Lueilwitz"
#> [33] "Mrs Torie Green d.d.s." "Mona Denesik"
#> [35] "Mitchell Auer" "Miss. Fae Price m.d."
#> [37] "Todd Lehner" "Elva Lesch"
#> [39] "Miss. Gustie Rempel DVM" "Lexie Parisian-Stark"
#> [41] "Beaulah Cremin-Rice" "Parrish Schinner"
#> [43] "Latrell Beier" "Garry Wolff Sr"
#> [45] "Bernhard Vandervort" "Stevie Johnston"
#> [47] "Dawson Gaylord" "Ivie Labadie"
#> [49] "Ronal Parker" "Mr Willy O'Conner Sr."
Right now only suffixes and prefixes for names in en_US
locale
are supported. Notice above some variation in prefixes and suffixes.
We have lots ot do still. Some of those things include:
faker
has the data, but we need to port it over
still.faker
.
In addition, we may find inspiration from faker libraries in other
programming languages.