Introduction

During my PHD, I often had to create CDF (cumulative distribution function) plots. For example, I use CDF plots in my paper Managing Latency and Excess Data Dissemination in Fog-Based Publish/Subscribe Systems (DOI/Website) for reporting latency measurement that have been collected by multiple end-devices for different data distribution strategies.

In this blog post, I will showcase why CDFs are a particulary good fit for such a use case and how easy it is to generate them with seaborn.

Exploring the Sample Data

For the purpose of this blog post, I created an artificial sample dataset with latency measurements for three coummunication paths.

df.head()
timestamp path latency
0 1 a 30.735979
1 2 a 28.509467
2 3 a 33.764358
3 4 a 29.585823
4 5 a 27.072539
sns.boxplot(data=df, x='path', y='latency')
<AxesSubplot:xlabel='path', ylabel='latency'>

Plotting the data in a boxplot already tells us that the communication path a experiences the smallest median latency. Communication path b has a slightly lower median latency then c, but larger min and max values.

sns.relplot(data=df, kind='line', x='timestamp', y='latency', col='path')
<seaborn.axisgrid.FacetGrid at 0x7ff397bef580>

Plotting the latencies as line plots tells us that the latency pattern does not change throughout the experiment.

sns.relplot(data=df, kind='scatter', x='timestamp', y='latency', col='path')
<seaborn.axisgrid.FacetGrid at 0x7ff397cdca90>

The scatter plot reveals something very interesting: There are two groups of measurements on communication path c: one with a latency of about 60, and one with a latency of about 40. This information is not available from the box or line plot.

Calculating the Cumulatative Distribution

For a reasearch paper, you typically only want as few plots as possible since you only have limited space available. Thus, a CDF plot is often a good option since it a high information densitiy. The first step for creating such a plot is to calculate the cumulative distribution of your input data. In our case, we want to plot the cumulative distribution of latency measurements for each path.

df_cdf = pd.DataFrame()
paths = df['path'].unique()

for path in paths:
      # create dataframe for each path
      path_df = df[(df['path'] == path)]

      # create dataframe with count per latency for the chosen path
      df_tmp = pd.DataFrame(dict(Counter(path_df['latency'])), index=[0]).T
      # transform index into column
      df_tmp.reset_index(inplace=True)
      # set correct column names
      df_tmp.columns = ['latency', 'count']
      # add a path column
      df_tmp.insert(0, 'path', path)

      # calculate distribution
      df_tmp.sort_values(by='latency', inplace=True)
      df_tmp["cumsum"] = df_tmp["count"].cumsum()
      sum = df_tmp["count"].sum()
      df_tmp["cumulative_distribution"] = df_tmp["cumsum"] / sum

      # add to result df
      df_cdf = df_cdf.append(df_tmp)
    
# sort and reset the index just for the asthetics
df_cdf.reset_index(inplace=True, drop=True)
df_cdf.sort_values(by="path", inplace=True)

# let's check how it looks
df_cdf
path latency count cumsum cumulative_distribution
0 a 18.794713 1 1 0.0001
6663 a 31.286361 1 6664 0.6664
6664 a 31.286763 1 6665 0.6665
6665 a 31.286795 1 6666 0.6666
6666 a 31.286961 1 6667 0.6667
... ... ... ... ... ...
23333 c 40.414325 1 3334 0.3334
23334 c 40.414616 1 3335 0.3335
23335 c 40.414784 1 3336 0.3336
23328 c 40.411314 1 3329 0.3329
29999 c 63.566645 1 10000 1.0000

30000 rows × 5 columns

As you can see, the resulting dataframe contains information on how often each latency occurs for each path, as well as corresponding cumulative distribution. Plotting this data is then straight forward.

sns.lineplot(data=df_cdf, x="latency", y="cumulative_distribution", hue="path")
<AxesSubplot:xlabel='latency', ylabel='cumulative_distribution'>

This plot tells us quite a lot. By looking at the cumulative distribution, we can easily retrieve the min (0.0) and max (1.0) latency, as well as latency ranges. E.g., to for path a, 60% of all measurements are betwwen 27 and 33. Furthermore, we can retrieve information on the distribution of values, e.g., b has a higher variance then a. We can also identify distinct groups of measurements by looking for steps in the distribution function as for path c. In this case, we can confirm that there are two groups, but also learn that each group has the same size, i.e., each group contains 50% of the measurements.

If your data has more dimensions, e.g., there could also be a client column that indicates which client sent a request, you only need to make minor modifications. To not collapse all dimensions into the path-lines, you have to add one nested for-loop for each additional dimension. This could then look like so:

```python paths = df['path'].unique()

for path in paths: for client in paths['client'].unique():

  # create dataframe for each path and client
  path_client_df = df[(df['path'] == path) & (df['client'] == client)]

  # continue as normal, but do not forget to also add a client column

Closing Remarks

In this blog post, I quickly showcased how one can create CDF plots with seaborn, and why they are a particularly good fit for latency measurements. The presented approach can also be applied to datasets with more dimensions by adding nested for-loops.