There is hardly any engineer or scientist who doesn't understand the need for synthetical data, also called synthetic data. But some may have asked themselves what do we understand by synthetical test data? There are lots of situtations, where a scientist or an engineer needs learn or test data, but it is hard or impossible to get real data, i.e. a sample from a population obtained by measurement. The task or challenge of creating synthetical data consists in producing data which resembles or comes quite close to the intended "real life" data. Python is an ideal language for easily producing such data, because it has powerful numerical and linguistic functionalities.
Synthetic data are also necessary to satisfy specific needs or certain conditions that may not be found in the "real life" data. Another use case of synthetical data is to protect privacy of the data needed.
In our previous chapter "Python, Numpy and Probability", we have written some functions, which we will need in the following:
- find_interval
- weighted_choice
- cartesian_choice
- weighted_cartesian_choice
- weighted_sample
You should be familiar with the way of working of these functions.
We saved the functions in a module with the name bk_random.
Definition of the Scope of Synthetic Data Creation
We want to provide solutions to the following task:
We have n finite sets containing data of various types:
D1, D2, ... Dn
The sets Di are the data sets from which we want to deduce our synthetical data.
In the actual implementation, the sets will be tuples or lists for practical reasons.
The process of creating synthetic data can be defined by two functions "synthesizer" and "synthesize". Usually, the word synthesizer is used for a computerized electronic device which produces sound. Our synthesizer produces strings or alternatively tuples with data, as we will see later.
The function synthesizer creates the function synthesize:
synthesize = synthesizer( (D1, D2, ... Dn) )
The function synthesize, - which may also be a generator like in our implementation, - takes no arguments and the result of a function call sythesize() will be
- a list or a tuple t = (d1, d2, ... dn) where di is drawn at random from Di
- or a string which contains the elements str(d1), str(d2), ... str(dn) where di is also drawn at random from Di
Let us start with a simple example. We have a list of firstnames and a list of surnames. We want to hire employees for an institute or company. Of course, it will be a lot easier in our synthetical Python environment to find and hire specialsts than in real life. The function "cartesian_choice" from the bk_random module and the concatenation of the randomly drawn firstnames and surnames is all it takes.
import bk_random
firstnames = ["John", "Eve", "Jane", "Paul",
"Frank", "Laura", "Robert",
"Kathrin", "Roger", "Simone",
"Bernard", "Sarah", "Yvonne"]
surnames = ["Singer", "Miles", "Moore",
"Looper", "Rampman", "Chopman",
"Smiley", "Bychan", "Smith",
"Baker", "Miller", "Cook"]
number_of_specialists = 15
employees = set()
while len(employees) < number_of_specialists:
employee = bk_random.cartesian_choice(firstnames, surnames)
employees.add(" ".join(employee))
print(employees)
This was easy enough, but we want to do it now in a more structured way, using the synthesizer approach we mentioned before. The code for the case in which the parameter "weights" is not None is still missing in the following implementation:
import bk_random
firstnames = ["John", "Eve", "Jane", "Paul",
"Frank", "Laura", "Robert",
"Kathrin", "Roger", "Simone",
"Bernard", "Sarah", "Yvonne"]
surnames = ["Singer", "Miles", "Moore",
"Looper", "Rampman", "Chopman",
"Smiley", "Bychan", "Smith",
"Baker", "Miller", "Cook"]
def synthesizer( data, weights=None, format_func=None, repeats=True):
"""
data is a tuple or list of lists or tuples containing the
data
weights is a list or tuple of lists or tuples with the
corresponding weights of the data lists or tuples
format_func is a reference to a function which defines
how a random result of the creator function will be formated.
If None, "creator" will return the list "res".
If repeats is set to True, the results of helper will not be unique
"""
def synthesize():
if not repeats:
memory = set()
while True:
res = bk_random.cartesian_choice(*data)
if not repeats:
sres = str(res)
while sres in memory:
res = bk_random.cartesian_choice(*data)
sres = str(res)
memory.add(sres)
if format_func:
yield format_func(res)
else:
yield res
return synthesize
recruit_employee = synthesizer( (firstnames, surnames),
format_func=lambda x: " ".join(x),
repeats=False)
employee = recruit_employee()
for _ in range(15):
print(next(employee))
Every name, i.e first name and last name, had the same likehood to be drawn in the previous example. This is not very realistic, because we will expect in countries like the US or England names like Smith and Miller to occur more often than names like Rampman or Bychan. We will extend our synthesizer function with additional code for the "weighted" case, i.e. weights is not None. If weights are given, we will have to use the function weighted_cartesian_choice from the bk_random module. If "weights" is set to None, we will have to call the function cartesian_choice. We put this decision into a different subfunction of synthesizer to keep the function synthesize clearer.
We do not want to fiddle around with probabilites between 0 and 1 in defining the weights, so we take the detour with integer, which we normalize afterwards.
from bk_random import cartesian_choice, weighted_cartesian_choice
weighted_firstnames = [ ("John", 80), ("Eve", 70), ("Jane", 2),
("Paul", 8), ("Frank", 20), ("Laura", 6),
("Robert", 17), ("Zoe", 3), ("Roger", 8),
("Edgar", 4), ("Susanne", 11), ("Dorothee", 22),
("Tim", 17), ("Donald", 12), ("Igor", 15),
("Simone", 9), ("Bernard", 8), ("Sarah", 7),
("Yvonne", 11), ("Bill", 12), ("Bernd", 10)]
weighted_surnames = [('Singer', 2), ('Miles', 2), ('Moore', 5),
('Strongman', 5), ('Romero', 3), ("Yiang", 4),
('Looper', 1), ('Rampman', 1), ('Chopman', 1),
('Smiley', 1), ('Bychan', 1), ('Smith', 150),
('Baker', 144), ('Miller', 87), ('Cook', 5),
('Joyce', 1), ('Bush', 5), ('Shorter', 6),
('Wagner', 10), ('Sundigos', 10), ('Firenze', 8),
('Puttner', 20), ('Faulkner', 10), ('Bowman', 11),
('Klein', 1), ('Jungster', 14), ("Warner", 14),
('Tiller', 9), ('Wogner', 10), ('Blumenthal', 16)]
firstnames, weights = zip(*weighted_firstnames)
wsum = sum(weights)
weights_firstnames = [ x / wsum for x in weights]
surnames, weights = zip(*weighted_surnames)
wsum = sum(weights)
weights_surnames = [ x / wsum for x in weights]
weights = (weights_firstnames, weights_surnames)
def synthesizer( data, weights=None, format_func=None, repeats=True):
"""
"data" is a tuple or list of lists or tuples containing the
data.
"weights" is a list or tuple of lists or tuples with the
corresponding weights of the data lists or tuples.
"format_func" is a reference to a function which defines
how a random result of the creator function will be formated.
If None,the generator "synthesize" will yield the list "res".
If "repeats" is set to True, the output values yielded by
"synthesize" will not be unique.
"""
def choice(data, weights):
if weights:
return weighted_cartesian_choice(*zip(data, weights))
else:
return cartesian_choice(*data)
def synthesize():
if not repeats:
memory = set()
while True:
res = choice(data, weights)
if not repeats:
sres = str(res)
while sres in memory:
res = choice(data, weights)
sres = str(res)
memory.add(sres)
if format_func:
yield format_func(res)
else:
yield res
return synthesize
recruit_employee = synthesizer( (firstnames, surnames),
weights = weights,
format_func=lambda x: " ".join(x),
repeats=False)
employee = recruit_employee()
for _ in range(12):
print(next(employee))
Wine Example
Let's imagine that you have to describe a dozen wines. Most probably a nice imagination for many, but I have to admit that it is not for me. The main reason is that I am not a wine drinker!
We can write a little Python program, which will use our synthesize function to create automatically "sophisticated criticisms" like this one:
This wine is light-bodied with a conveniently juicy bouquet leading to a lingering flamboyant finish!
Try to find some adverbs, like "seamlessly", "assertively", and some adjectives, like "fruity" and "refined", to describe the aroma.
If you have defined your lists, you can use the synthesize function.
Here is our solution, in case you don't want to do it on your own:
import bk_random
body = ['light-bodied', 'medium-bodied', 'full-bodied']
adverbs = ['appropriately', 'assertively', 'authoritatively',
'compellingly', 'completely', 'continually',
'conveniently', 'credibly', 'distinctively',
'dramatically', 'dynamically', 'efficiently',
'energistically', 'enthusiastically', 'fungibly',
'globally', 'holisticly', 'interactively',
'intrinsically', 'monotonectally', 'objectively',
'phosfluorescently', 'proactively', 'professionally',
'progressively', 'quickly', 'rapidiously',
'seamlessly', 'synergistically', 'uniquely']
noun = ['aroma', 'bouquet', 'flavour']
aromas = ['angular', 'bright', 'lingering', 'butterscotch',
'buttery', 'chocolate', 'complex', 'earth', 'flabby',
'flamboyant', 'fleshy', 'flowers', 'food friendly',
'fruits', 'grass', 'herbs', 'jammy', 'juicy', 'mocha',
'oaked', 'refined', 'structured', 'tight', 'toast',
'toasty', 'tobacco', 'unctuous', 'unoaked', 'vanilla',
'velvetly']
example = """This wine is light-bodied with a completely buttery
bouquet leading to a lingering fruity finish!"""
def describe(data):
body, adv, adj, noun, adj2 = data
format_str = "This wine is %s with a %s %s %s\nleading to"
format_str += " a lingering %s finish!"
return format_str % (body, adv, adj, noun, adj2)
t = bk_random.cartesian_choice(body, adverbs, aromas, noun, aromas)
data = (body, adverbs, aromas, noun, aromas)
synthesize = synthesizer( data, weights=None, format_func=describe, repeats=True)
criticism = synthesize()
for i in range(1, 13):
print("{0:d}. wine:".format(i))
print(next(criticism))
print()
Exercise: International Disaster Operation
It would be gorgeous, if the problem described in this exercise, would be purely synthetic, i.e. there would be no further catastophes in the world. Completely unrealistic, but a nice daydream. So, the task of this exercise is to provide synthetical test data for an international disaster operation. The countries taking part in this mission might be e.g. France, Switzerland, Germany, Canada, The Netherlands, The United States, Austria, Belgium and Luxembourg.
We want to create a file with random entries of aides. Each line should consist of:
UniqueIdentifier, FirstName, LastName, Country, Field
For example:
001, Jean-Paul, Rennier, France, Medical Aid 002, Nathan, Bloomfield, Canada, Security Aid 003, Michael, Mayer, Germany, Social Worker
For practical reasons, we will reduce the countries to France, Italy, Switzerland and Germany in the following example implementation:
from bk_random import cartesian_choice, weighted_cartesian_choice
countries = ["France", "Switzerland", "Germany"]
w_firstnames = { "France" : [ ("Marie", 10), ("Thomas", 10),
("Camille", 10), ("Nicolas", 9),
("Léa", 10), ("Julien", 9),
("Manon", 9), ("Quentin", 9),
("Chloé", 8), ("Maxime", 9),
("Laura", 7), ("Alexandre", 6),
("Clementine", 2), ("Grégory", 2),
("Sandra", 1), ("Philippe", 1)],
"Switzerland": [ ("Sarah", 10), ("Hans", 10),
("Laura", 9), ("Peter", 8),
("Mélissa", 9), ("Walter", 7),
("Océane", 7), ("Daniel", 7),
("Noémie", 6), ("Reto", 7),
("Laura", 7), ("Bruno", 6),
("Eva", 2), ("Urli", 4),
("Sandra", 1), ("Marcel", 1)],
"Germany": [ ("Ursula", 10), ("Peter", 10),
("Monika", 9), ("Michael", 8),
("Brigitte", 9), ("Thomas", 7),
("Stefanie", 7), ("Andreas", 7),
("Maria", 6), ("Wolfgang", 7),
("Gabriele", 7), ("Manfred", 6),
("Nicole", 2), ("Matthias", 4),
("Christine", 1), ("Dirk", 1)],
"Italy" : [ ("Francesco", 20), ("Alessandro", 19),
("Mattia", 19), ("Lorenzo", 18),
("Leonardo", 16), ("Andrea", 15),
("Gabriele", 14), ("Matteo", 14),
("Tommaso", 12), ("Riccardo", 11),
("Sofia", 20), ("Aurora", 18),
("Giulia", 16), ("Giorgia", 15),
("Alice", 14), ("Martina", 13)]}
w_surnames = { "France" : [ ("Matin", 10), ("Bernard", 10),
("Camille", 10), ("Nicolas", 9),
("Dubois", 10), ("Petit", 9),
("Durand", 8), ("Leroy", 8),
("Fournier", 7), ("Lambert", 6),
("Mercier", 5), ("Rousseau", 4),
("Mathieu", 2), ("Fontaine", 2),
("Muller", 1), ("Robin", 1)],
"Switzerland": [ ("Müller", 10), ("Meier", 10),
("Schmid", 9), ("Keller", 8),
("Weber", 9), ("Huber", 7),
("Schneider", 7), ("Meyer", 7),
("Steiner", 6), ("Fischer", 7),
("Gerber", 7), ("Brunner", 6),
("Baumann", 2), ("Frei", 4),
("Zimmermann", 1), ("Moser", 1)],
"Germany": [ ("Müller", 10), ("Schmidt", 10),
("Schneider", 9), ("Fischer", 8),
("Weber", 9), ("Meyer", 7),
("Wagner", 7), ("Becker", 7),
("Schulz", 6), ("Hoffmann", 7),
("Schäfer", 7), ("Koch", 6),
("Bauer", 2), ("Richter", 4),
("Klein", 2), ("Schröder", 1)],
"Italy" : [ ("Rossi", 20), ("Russo", 19),
("Ferrari", 19), ("Esposito", 18),
("Bianchi", 16), ("Romano", 15),
("Colombo", 14), ("Ricci", 14),
("Marino", 12), ("Grecco", 11),
("Bruno", 10), ("Gallo", 12),
("Conti", 16), ("De Luca", 15),
("Costa", 14), ("Giordano", 13),
("Mancini", 14), ("Rizzo", 13),
("Lombardi", 11), ("Moretto", 9)]}
# separate names and weights
synthesize = {}
identifier = 1
for country in w_firstnames:
firstnames, weights = zip(*w_firstnames[country])
wsum = sum(weights)
weights_firstnames = [ x / wsum for x in weights]
w_firstnames[country] = [firstnames, weights_firstnames]
surnames, weights = zip(*w_surnames[country])
wsum = sum(weights)
weights_surnames = [ x / wsum for x in weights]
w_surnames[country] = [surnames, weights_firstnames]
synthesize[country] = synthesizer( (firstnames, surnames),
(weights_firstnames,
weights_surnames),
format_func=lambda x: " ".join(x),
repeats=False)
nation_prob = [("Germany", 0.3),
("France", 0.4),
("Switzerland", 0.2),
("Italy", 0.1)]
profession_prob = [("Medical Aid", 0.3),
("Social Worker", 0.6),
("Security Aid", 0.1)]
helpers = []
for _ in range(200):
country = weighted_cartesian_choice(zip(*nation_prob))
profession = weighted_cartesian_choice(zip(*profession_prob))
country, profession = country[0], profession[0]
s = synthesize[country]()
uid = "{id:05d}".format(id=identifier)
helpers.append((uid, country, next(s), profession ))
identifier += 1
print(helpers)
with open("disaster_mission.txt", "w") as fh:
fh.write("Reference number,Country,Name,Function\n")
for el in helpers:
fh.write(",".join(el) + "\n")