Binning in Python and Pandas



Introduction

Binning

Data binning, which is also known as bucketing or discretization, is a technique used in data processing and statistics. Binning can be used for example, if there are more possible data points than observed data points. An example is to bin the body heights of people into intervals or categories. Let us assume, we take the heights of 30 people. The length values can be between - roughly guessing - 1.30 metres to 2.50 metres. Theoretically, there are 120 different cm values possible, but we can have at most 30 different values from our sample group. One way to group them could be to put the measured values into bins ranging from 1.30 - 1.50 metres, 1.50 - 1.70 metres, 1.70 - 1.90 metres and so on. This means that the original data values, will be assigned to a bin into wich they fit according to their size. The original values will be replaced by values representing the corresponding intervals. Binning is a form of quantization.

Bins do not necessarily have to be numerical, they can be categorical values of any kind, like "dogs", "cats", "hamsters", and so on.

Binning is also used in image processing, binning. It can be used to reduce the amount of data, by combining neighboring pixel into single pixels. kxk binning reduces areas of k x k pixels into single pixel.



Pandas provides easy ways to create bins and to bin data. Before we describe these Pandas functionalities, we will introduce basic Python functions, working on Python lists and tuples.

Binning in Python

The following Python function can be used to create bins.

def create_bins(lower_bound, width, quantity):
    """ create_bins returns an equal-width (distance) partitioning. 
        It returns an ascending list of tuples, representing the intervals.
        A tuple bins[i], i.e. (bins[i][0], bins[i][1])  with i > 0 
        and i < quantity, satisfies the following conditions:
            (1) bins[i][0] + width == bins[i][1]
            (2) bins[i-1][0] + width == bins[i][0] and
                bins[i-1][1] + width == bins[i][1]
    """
    
    bins = []
    for low in range(lower_bound, 
                     lower_bound + quantity*width + 1, width):
        bins.append((low, low+width))
    return bins

We will create now five bins (quantity=5) with a width of 10 (width=10) starting from 10 (lower_bound=10):

bins = create_bins(lower_bound=10,
                   width=10,
                   quantity=5)
bins
After having executed the Python code above we received the following:
[(10, 20), (20, 30), (30, 40), (40, 50), (50, 60), (60, 70)]

The next function 'find_bin' is called with a list or tuple of bin 'bins', which have to be two-tuples or lists of two elements. The function finds the index of the interval, where the value 'value' is contained:

def find_bin(value, bins):
    """ bins is a list of tuples, like [(0,20), (20, 40), (40, 60)],
        binning returns the smallest index i of bins so that
        bin[i][0] <= value < bin[i][1]
    """
    
    for i in range(0, len(bins)):
        if bins[i][0] <= value < bins[i][1]:
            return i
    return -1
from collections import Counter
bins = create_bins(lower_bound=50,
                   width=4,
                   quantity=10)
print(bins)
weights_of_persons = [73.4, 69.3, 64.9, 75.6, 74.9, 80.3, 
                      78.6, 84.1, 88.9, 90.3, 83.4, 69.3, 
                      52.4, 58.3, 67.4, 74.0, 89.3, 63.4]
binned_weights = []
for value in weights_of_persons:
    bin_index = find_bin(value, bins)
    print(value, bin_index, bins[bin_index])
    binned_weights.append(bin_index)
    
frequencies = Counter(binned_weights)
print(frequencies)
[(50, 54), (54, 58), (58, 62), (62, 66), (66, 70), (70, 74), (74, 78), (78, 82), (82, 86), (86, 90), (90, 94)]
73.4 5 (70, 74)
69.3 4 (66, 70)
64.9 3 (62, 66)
75.6 6 (74, 78)
74.9 6 (74, 78)
80.3 7 (78, 82)
78.6 7 (78, 82)
84.1 8 (82, 86)
88.9 9 (86, 90)
90.3 10 (90, 94)
83.4 8 (82, 86)
69.3 4 (66, 70)
52.4 0 (50, 54)
58.3 2 (58, 62)
67.4 4 (66, 70)
74.0 6 (74, 78)
89.3 9 (86, 90)
63.4 3 (62, 66)
Counter({4: 3, 6: 3, 3: 2, 7: 2, 8: 2, 9: 2, 5: 1, 10: 1, 0: 1, 2: 1})

Binning with Pandas

The module Pandas of Python provides powerful functionalities for the binning of data. We will demonstrate this by using our previous data.

Bins used by Pandas

We used a list of tuples as bins in our previous example. We have to turn this list into a usable data structure for the pandas function "cut". This data structure is an IntervalIndex. We can do this with pd.IntervalIndex.from_tuples:

import pandas as pd
bins2 = pd.IntervalIndex.from_tuples(bins)

"cut" is the name of the Pandas function, which is needed to bin values into bins. "cut" takes many parameters but the most important ones are "x" for the actual values und "bins", defining the IntervalIndex. "x" can be any 1-dimensional array-like structure, e.g. tuples, lists, nd-arrays and so on:

categorical_object = pd.cut(weights_of_persons, bins2)
print(categorical_object)
[(70, 74], (66, 70], (62, 66], (74, 78], (74, 78], ..., (58, 62], (66, 70], (70, 74], (86, 90], (62, 66]]
Length: 18
Categories (11, interval[int64]): [(50, 54] < (54, 58] < (58, 62] < (62, 66] ... (78, 82] < (82, 86] < (86, 90] < (90, 94]]

The result of the Pandas function "cut" is a so-called "Categorical object". Each bin is a category. The categories are described in a mathematical notation. "(70, 74]" means that this bins contains values from 70 to 74 whereas 70 is not included but 74 is included. Mathematically, this is a half-open interval, i.e. nn interval in which one endpoint is included but not the other. Sometimes it is also called an half-closed interval.

We had also defined the bins in our previous chapter as half-open intervals, but the other way round, i.e. left side closed and the right side open. When we used pd.IntervalIndex.from_tuples, we could have defined the "openness" of this bins by setting the parameter "closed" to one of the values:

To have the same behaviour as in our previous chapter, we will set the parameter closed to "left":

bins2 = pd.IntervalIndex.from_tuples(bins, closed="left")
categorical_object = pd.cut(weights_of_persons, bins2)
print(categorical_object)
[[70, 74), [66, 70), [62, 66), [74, 78), [74, 78), ..., [58, 62), [66, 70), [74, 78), [86, 90), [62, 66)]
Length: 18
Categories (11, interval[int64]): [[50, 54) < [54, 58) < [58, 62) < [62, 66) ... [78, 82) < [82, 86) < [86, 90) < [90, 94)]

Other Ways to Define Bins

We used an IntervalIndex as a bin for binning the weight data. The function "cut" can also cope with two other kinds of bin representations:

categorical_object = pd.cut(weights_of_persons, 18)
print(categorical_object)
[(72.465, 74.694], (68.006, 70.235], (63.547, 65.776], (74.694, 76.924], (74.694, 76.924], ..., (56.859, 59.088], (65.776, 68.006], (72.465, 74.694], (88.071, 90.3], (61.318, 63.547]]
Length: 18
Categories (17, interval[float64]): [(52.362, 54.629] < (54.629, 56.859] < (56.859, 59.088] < (59.088, 61.318] ... (81.382, 83.612] < (83.612, 85.841] < (85.841, 88.071] < (88.071, 90.3]]
sequence_of_scalars = [ x[0] for x in bins]
sequence_of_scalars.append(bins[-1][1])
print(sequence_of_scalars)
categorical_object = pd.cut(weights_of_persons, 
                            sequence_of_scalars,
                            right=False)
print(categorical_object)
[50, 54, 58, 62, 66, 70, 74, 78, 82, 86, 90, 94]
[[70, 74), [66, 70), [62, 66), [74, 78), [74, 78), ..., [58, 62), [66, 70), [74, 78), [86, 90), [62, 66)]
Length: 18
Categories (11, interval[int64]): [[50, 54) < [54, 58) < [58, 62) < [62, 66) ... [78, 82) < [82, 86) < [86, 90) < [90, 94)]

Bin counts and value counts

The next and most interesting question is now how we can see the actual bin counts. This can be accomplished with the function "value_counts":

pd.value_counts(categorical_object)
The previous Python code returned the following result:
[74, 78)    3
[66, 70)    3
[86, 90)    2
[82, 86)    2
[78, 82)    2
[62, 66)    2
[90, 94)    1
[70, 74)    1
[58, 62)    1
[50, 54)    1
[54, 58)    0
dtype: int64

"categorical_object.codes" provides you with a labelling of the input values into the binning categories:

labels = categorical_object.codes
labels
This gets us the following:
array([ 5,  4,  3,  6,  6,  7,  7,  8,  9, 10,  8,  4,  0,  2,  4,  6,  9,
        3], dtype=int8)

categories is the IntervalIndex of the categories of the label indices:

categories = categorical_object.categories
categories
The previous Python code returned the following output:
IntervalIndex([[50, 54), [54, 58), [58, 62), [62, 66), [66, 70) ... [74, 78), [78, 82), [82, 86), [86, 90), [90, 94)]
              closed='left',
              dtype='interval[int64]')

Correspondence from weights data to bins:

for index in range(len(weights_of_persons)):
    label_index = labels[index]
    print(weights_of_persons[index], label_index, categories[label_index] )
73.4 5 [70, 74)
69.3 4 [66, 70)
64.9 3 [62, 66)
75.6 6 [74, 78)
74.9 6 [74, 78)
80.3 7 [78, 82)
78.6 7 [78, 82)
84.1 8 [82, 86)
88.9 9 [86, 90)
90.3 10 [90, 94)
83.4 8 [82, 86)
69.3 4 [66, 70)
52.4 0 [50, 54)
58.3 2 [58, 62)
67.4 4 [66, 70)
74.0 6 [74, 78)
89.3 9 [86, 90)
63.4 3 [62, 66)
categorical_object.categories
The above code returned the following output:
IntervalIndex([[50, 54), [54, 58), [58, 62), [62, 66), [66, 70) ... [74, 78), [78, 82), [82, 86), [86, 90), [90, 94)]
              closed='left',
              dtype='interval[int64]')

Naming bins

Let's imagine, we have an University, which confers three levels of Latin honors depending on the grade point average (GPA):

degrees = ["none", "cum laude", "magna cum laude", "summa cum laude"]
student_results = [3.93, 3.24, 2.80, 2.83, 3.91, 3.698, 3.731, 3.25, 3.24, 3.82, 3.22]
student_results_degrees = pd.cut(student_results, [0, 3.6, 3.8, 3.9, 4.0], labels=degrees)
pd.value_counts(student_results_degrees)
The previous code returned the following result:
none               6
summa cum laude    2
cum laude          2
magna cum laude    1
dtype: int64

Let's have a look at the individual degrees of each student:

labels = student_results_degrees.codes
categories = student_results_degrees.categories
for index in range(len(student_results)):
    label_index = labels[index]
    print(student_results[index], label_index, categories[label_index] )
3.93 3 summa cum laude
3.24 0 none
2.8 0 none
2.83 0 none
3.91 3 summa cum laude
3.698 1 cum laude
3.731 1 cum laude
3.25 0 none
3.24 0 none
3.82 2 magna cum laude
3.22 0 none