33. Binning in Python and Pandas
By Bernd Klein. Last modified: 26 Apr 2023.
Introduction
Data binning, which is also known as bucketing or discretization, is a technique used in data processing and statistics. Binning can be used for example, if there are more possible data points than observed data points. An example is to bin the body heights of people into intervals or categories. Let us assume, we take the heights of 30 people. The length values can be between - roughly guessing - 1.30 metres to 2.50 metres. Theoretically, there are 120 different cm values possible, but we can have at most 30 different values from our sample group. One way to group them could be to put the measured values into bins ranging from 1.30 - 1.50 metres, 1.50 - 1.70 metres, 1.70 - 1.90 metres and so on. This means that the original data values, will be assigned to a bin into wich they fit according to their size. The original values will be replaced by values representing the corresponding intervals. Binning is a form of quantization.
Bins do not necessarily have to be numerical, they can be categorical values of any kind, like "dogs", "cats", "hamsters", and so on.
Binning is also used in image processing, binning. It can be used to reduce the amount of data, by combining neighboring pixel into single pixels. kxk binning reduces areas of k x k pixels into single pixel.
Pandas provides easy ways to create bins and to bin data. Before we describe these Pandas functionalities, we will introduce basic Python functions, working on Python lists and tuples.
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Binning in Python
The following Python function can be used to create bins.
def create_bins(lower_bound, width, quantity):
""" create_bins returns an equal-width (distance) partitioning.
It returns an ascending list of tuples, representing the intervals.
A tuple bins[i], i.e. (bins[i][0], bins[i][1]) with i > 0
and i < quantity, satisfies the following conditions:
(1) bins[i][0] + width == bins[i][1]
(2) bins[i-1][0] + width == bins[i][0] and
bins[i-1][1] + width == bins[i][1]
"""
bins = []
for low in range(lower_bound,
lower_bound + quantity*width + 1, width):
bins.append((low, low+width))
return bins
We will create now five bins (quantity=5) with a width of 10 (width=10) starting from 10 (lower_bound=10):
bins = create_bins(lower_bound=10,
width=10,
quantity=5)
bins
OUTPUT:
[(10, 20), (20, 30), (30, 40), (40, 50), (50, 60), (60, 70)]
The next function 'find_bin' is called with a list or tuple of bin 'bins', which have to be two-tuples or lists of two elements. The function finds the index of the interval, where the value 'value' is contained:
def find_bin(value, bins):
""" bins is a list of tuples, like [(0,20), (20, 40), (40, 60)],
binning returns the smallest index i of bins so that
bin[i][0] <= value < bin[i][1]
"""
for i in range(0, len(bins)):
if bins[i][0] <= value < bins[i][1]:
return i
return -1
from collections import Counter
bins = create_bins(lower_bound=50,
width=4,
quantity=10)
print(bins)
weights_of_persons = [73.4, 69.3, 64.9, 75.6, 74.9, 80.3,
78.6, 84.1, 88.9, 90.3, 83.4, 69.3,
52.4, 58.3, 67.4, 74.0, 89.3, 63.4]
binned_weights = []
for value in weights_of_persons:
bin_index = find_bin(value, bins)
print(value, bin_index, bins[bin_index])
binned_weights.append(bin_index)
frequencies = Counter(binned_weights)
print(frequencies)
OUTPUT:
[(50, 54), (54, 58), (58, 62), (62, 66), (66, 70), (70, 74), (74, 78), (78, 82), (82, 86), (86, 90), (90, 94)] 73.4 5 (70, 74) 69.3 4 (66, 70) 64.9 3 (62, 66) 75.6 6 (74, 78) 74.9 6 (74, 78) 80.3 7 (78, 82) 78.6 7 (78, 82) 84.1 8 (82, 86) 88.9 9 (86, 90) 90.3 10 (90, 94) 83.4 8 (82, 86) 69.3 4 (66, 70) 52.4 0 (50, 54) 58.3 2 (58, 62) 67.4 4 (66, 70) 74.0 6 (74, 78) 89.3 9 (86, 90) 63.4 3 (62, 66) Counter({4: 3, 6: 3, 3: 2, 7: 2, 8: 2, 9: 2, 5: 1, 10: 1, 0: 1, 2: 1})
Binning with Pandas
The module Pandas of Python provides powerful functionalities for the binning of data. We will demonstrate this by using our previous data.
Bins used by Pandas
We used a list of tuples as bins in our previous example. We have to turn this list into a usable data structure for the pandas function "cut". This data structure is an IntervalIndex. We can do this with pd.IntervalIndex.from_tuples:
import pandas as pd
bins2 = pd.IntervalIndex.from_tuples(bins)
"cut" is the name of the Pandas function, which is needed to bin values into bins. "cut" takes many parameters but the most important ones are "x" for the actual values und "bins", defining the IntervalIndex. "x" can be any 1-dimensional array-like structure, e.g. tuples, lists, nd-arrays and so on:
categorical_object = pd.cut(weights_of_persons, bins2)
print(categorical_object)
OUTPUT:
[(70, 74], (66, 70], (62, 66], (74, 78], (74, 78], ..., (58, 62], (66, 70], (70, 74], (86, 90], (62, 66]] Length: 18 Categories (11, interval[int64, right]): [(50, 54] < (54, 58] < (58, 62] < (62, 66] ... (78, 82] < (82, 86] < (86, 90] < (90, 94]]
The result of the Pandas function "cut" is a so-called "Categorical object". Each bin is a category. The categories are described in a mathematical notation. "(70, 74]" means that this bins contains values from 70 to 74 whereas 70 is not included but 74 is included. Mathematically, this is a half-open interval, i.e. nn interval in which one endpoint is included but not the other. Sometimes it is also called an half-closed interval.
We had also defined the bins in our previous chapter as half-open intervals, but the other way round, i.e. left side closed and the right side open. When we used pd.IntervalIndex.from_tuples, we could have defined the "openness" of this bins by setting the parameter "closed" to one of the values:
- 'left': closed on the left side and open on the right
- 'right': (The default) open on the left side and closed on the right
- 'both': closed on both sides
- 'neither': open on both sides
To have the same behaviour as in our previous chapter, we will set the parameter closed to "left":
bins2 = pd.IntervalIndex.from_tuples(bins, closed="left")
categorical_object = pd.cut(weights_of_persons, bins2)
print(categorical_object)
OUTPUT:
[[70, 74), [66, 70), [62, 66), [74, 78), [74, 78), ..., [58, 62), [66, 70), [74, 78), [86, 90), [62, 66)] Length: 18 Categories (11, interval[int64, left]): [[50, 54) < [54, 58) < [58, 62) < [62, 66) ... [78, 82) < [82, 86) < [86, 90) < [90, 94)]
Other Ways to Define Bins
We used an IntervalIndex as a bin for binning the weight data. The function "cut" can also cope with two other kinds of bin representations:
- an integer:
defining the number of equal-width bins in the range of the values "x". Therange of "x" is extended by .1% on each side to include the minimum and maximum values of "x".
- sequence of scalars:
Defines the bin edges allowing for non-uniformwidth. No extension of the range of "x" is done.
categorical_object = pd.cut(weights_of_persons, 18)
print(categorical_object)
OUTPUT:
[(71.35, 73.456], (69.244, 71.35], (62.928, 65.033], (75.561, 77.667], (73.456, 75.561], ..., (56.611, 58.717], (67.139, 69.244], (73.456, 75.561], (88.194, 90.3], (62.928, 65.033]] Length: 18 Categories (18, interval[float64, right]): [(52.362, 54.506] < (54.506, 56.611] < (56.611, 58.717] < (58.717, 60.822] ... (81.878, 83.983] < (83.983, 86.089] < (86.089, 88.194] < (88.194, 90.3]]
sequence_of_scalars = [ x[0] for x in bins]
sequence_of_scalars.append(bins[-1][1])
print(sequence_of_scalars)
categorical_object = pd.cut(weights_of_persons,
sequence_of_scalars,
right=False)
print(categorical_object)
OUTPUT:
[50, 54, 58, 62, 66, 70, 74, 78, 82, 86, 90, 94] [[70, 74), [66, 70), [62, 66), [74, 78), [74, 78), ..., [58, 62), [66, 70), [74, 78), [86, 90), [62, 66)] Length: 18 Categories (11, interval[int64, left]): [[50, 54) < [54, 58) < [58, 62) < [62, 66) ... [78, 82) < [82, 86) < [86, 90) < [90, 94)]
Bin counts and value counts
The next and most interesting question is now how we can see the actual bin counts. This can be accomplished with the function "value_counts":
pd.value_counts(categorical_object)
OUTPUT:
[66, 70) 3 [74, 78) 3 [62, 66) 2 [78, 82) 2 [82, 86) 2 [86, 90) 2 [50, 54) 1 [58, 62) 1 [70, 74) 1 [90, 94) 1 [54, 58) 0 dtype: int64
"categorical_object.codes" provides you with a labelling of the input values into the binning categories:
labels = categorical_object.codes
labels
OUTPUT:
array([ 5, 4, 3, 6, 6, 7, 7, 8, 9, 10, 8, 4, 0, 2, 4, 6, 9, 3], dtype=int8)
categories is the IntervalIndex of the categories of the label indices:
categories = categorical_object.categories
categories
OUTPUT:
IntervalIndex([[50, 54), [54, 58), [58, 62), [62, 66), [66, 70) ... [74, 78), [78, 82), [82, 86), [86, 90), [90, 94)], dtype='interval[int64, left]')
Correspondence from weights data to bins:
for index in range(len(weights_of_persons)):
label_index = labels[index]
print(weights_of_persons[index], label_index, categories[label_index] )
OUTPUT:
73.4 5 [70, 74) 69.3 4 [66, 70) 64.9 3 [62, 66) 75.6 6 [74, 78) 74.9 6 [74, 78) 80.3 7 [78, 82) 78.6 7 [78, 82) 84.1 8 [82, 86) 88.9 9 [86, 90) 90.3 10 [90, 94) 83.4 8 [82, 86) 69.3 4 [66, 70) 52.4 0 [50, 54) 58.3 2 [58, 62) 67.4 4 [66, 70) 74.0 6 [74, 78) 89.3 9 [86, 90) 63.4 3 [62, 66)
categorical_object.categories
OUTPUT:
IntervalIndex([[50, 54), [54, 58), [58, 62), [62, 66), [66, 70) ... [74, 78), [78, 82), [82, 86), [86, 90), [90, 94)], dtype='interval[int64, left]')
Naming bins
Let's imagine, we have an University, which confers three levels of Latin honors depending on the grade point average (GPA):
- "summa cum laude" requires a GPA above 3.9
- "magna cum laude" if the GPA is above 3.8
- "cum laude" if the GPA of 3.6 or above
degrees = ["none", "cum laude", "magna cum laude", "summa cum laude"]
student_results = [3.93, 3.24, 2.80, 2.83, 3.91, 3.698, 3.731, 3.25, 3.24, 3.82, 3.22]
student_results_degrees = pd.cut(student_results, [0, 3.6, 3.8, 3.9, 4.0], labels=degrees)
pd.value_counts(student_results_degrees)
OUTPUT:
none 6 cum laude 2 summa cum laude 2 magna cum laude 1 dtype: int64
Let's have a look at the individual degrees of each student:
labels = student_results_degrees.codes
categories = student_results_degrees.categories
for index in range(len(student_results)):
label_index = labels[index]
print(student_results[index], label_index, categories[label_index] )
OUTPUT:
3.93 3 summa cum laude 3.24 0 none 2.8 0 none 2.83 0 none 3.91 3 summa cum laude 3.698 1 cum laude 3.731 1 cum laude 3.25 0 none 3.24 0 none 3.82 2 magna cum laude 3.22 0 none
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Upcoming online Courses