Chapter 3: Modeling¶
Types of modeling¶
In this chapter we will focus on doing modeling on our datasets. In our case the term “modeling” refers to any activity that takes in our data and produces a result. We’ll be doing three differing types of modeling:
Creating summaries of our data by calculating statistics (e.g. mean, variance).
Fitting a function into our data.
Using bootstrapping methods to calculate statistical moments.
To accomplish these task, we’ll be using various features of our frameworks:
Groupings - Grouping data by a common variable for efficient summaries and plotting.
Apply- and map-functions - Functions that run a function for all rows of our data structure.
Nesting/stacking - Storing data frames in our data frames.
Grouping data by a common variable¶
Description of data - Triton cluster file statistics¶
In grouping one determines one or more variables to use as grouping index and then these indices are used to apply some function to each subgroup. They are especially useful when one is calculating some statistics from each subgroup, doing multiple plots or determining plot colors.
This can be visualized in the following way:
Let’s consider the dataset presented below. It is contains the number of files stored on Aalto University’s Triton cluster’s file system. Columns are:
File size rounded down to the nearest exponent of 2 (e.g. 3 MB file would be counted to row with size 2097152 (2^21 = 2097152))
File modification date in months before 31.12.2020 (e.g. file written in 3.9.2020 would be 3)
Number of files
def load_filesizes(filesizes_file):
filesizes = pd.read_table(filesizes_file, sep='\s+', names=['Bytes','MonthsTo2021', 'Files'])
# Remove empty files
filesizes = filesizes[filesizes.loc[:,'Bytes'] != 0]
# Create a column for log2 of bytes
filesizes['BytesLog2'] = np.log2(filesizes.loc[:, 'Bytes'])
filesizes.loc[:,'BytesLog2'] = filesizes.loc[:,'BytesLog2'].astype(np.int64)
# Determine total space S used by N files of size X during date D: S=N*X
filesizes['SpaceUsage'] = filesizes.loc[:,'Bytes']*filesizes.loc[:,'Files']
# Determine file year and month from the MonthsTo2021-column
filesizes['TotalMonths'] = 2021*12 - filesizes['MonthsTo2021'] - 1
filesizes['Year'] = filesizes['TotalMonths'] // 12
filesizes['Month'] = filesizes['TotalMonths'] % 12 + 1
filesizes['Day'] = 1
# Set year for really old files and files with incorrect timestamps
invalid_years = (filesizes['Year'] < 2010) | (filesizes['Year'] > 2020)
filesizes.loc[invalid_years, ['Year','Month']] = np.NaN
# Get month names for the correct ordering of Month categories
month_names = pd.date_range(start='2000-01', freq='M', periods=12).month_name()
# Create Date
filesizes['Date'] = pd.to_datetime(filesizes[['Year', 'Month', 'Day']])
# Set Month
filesizes['Month'] = pd.Categorical(filesizes['Date'].dt.month_name(), categories=month_names, ordered=True)
# Set Month to be an ordered categorical with predefined levels
filesizes['Month'] = pd.Categorical(filesizes['Month'], categories=month_names, ordered=True)
# Sort data based on Date and BytesLog2
filesizes.sort_values(['Date','BytesLog2'], inplace=True)
# Remove old columns
filesizes.drop(['MonthsTo2021','TotalMonths', 'Day'], axis=1, inplace=True)
return filesizes
filesizes = load_filesizes('../data/filesizes_timestamps.txt')
filesizes.head()
Bytes Files BytesLog2 SpaceUsage Year Month Date
287 1 5 0 5 2010.0 January 2010-01-01
451 2 3 1 6 2010.0 January 2010-01-01
627 4 27 2 108 2010.0 January 2010-01-01
822 8 136 3 1088 2010.0 January 2010-01-01
1057 16 208 4 3328 2010.0 January 2010-01-01
load_filesizes <- function(filesizes_file){
filesizes <- read_table2(filesizes_file, col_names=c('Bytes','MonthsTo2021', 'Files'))
filesizes <- filesizes %>%
# Remove empty files
filter(Bytes != 0) %>%
# Create a column for log2 of bytes
mutate(BytesLog2 = log2(Bytes)) %>%
# Determine total space S used by N files of size X during date D: S=N*X
mutate(SpaceUsage = Bytes*Files) %>%
# Determine file year and month from the MonthsTo2021-column
mutate(
TotalMonths = 2021*12 - MonthsTo2021 - 1,
Year = TotalMonths %/% 12,
Month = TotalMonths %% 12 +1,
Day = 1
)
# Set year for really old files and files with incorrect timestamps
invalid_years = c((filesizes['Year'] < 2010) | (filesizes['Year'] > 2020))
filesizes[invalid_years, c('Year','Month')] <- NaN
# Get month names for the correct ordering of Month categories
month_names <- month(seq(1,12), label=TRUE, locale='C')
filesizes <- filesizes %>%
mutate(
# Create Date and get the name for the month
Date = make_datetime(Year, Month, Day),
# Set Month
Month=month(Month, label=TRUE, locale='C'),
# Set Month to be an ordered categorical with predefined levels
Month=factor(Month, ordered=TRUE, levels=month_names))
filesizes <- filesizes %>%
# Sort data based on Date and BytesLog2
arrange(Date, BytesLog2) %>%
# Remove old columns
select(-MonthsTo2021,-TotalMonths,-Day)
return(filesizes)
}
filesizes <- load_filesizes('../data/filesizes_timestamps.txt')
head(filesizes)
Parsed with column specification:
cols(
Bytes = col_double(),
MonthsTo2021 = col_double(),
Files = col_double()
)
A tibble: 6 × 7 Bytes Files BytesLog2 SpaceUsage Year Month Date
<dbl> <dbl> <dbl> <dbl> <dbl> <ord> <dttm>
1 5 0 5 2010 Jan 2010-01-01
2 3 1 6 2010 Jan 2010-01-01
4 27 2 108 2010 Jan 2010-01-01
8 136 3 1088 2010 Jan 2010-01-01
16 208 4 3328 2010 Jan 2010-01-01
32 653 5 20896 2010 Jan 2010-01-01
Simple groupings and summaries - Calculating new files per year¶
Summaries are usually statistics that are calculated on some (or all) columns based on some grouping. Typical summarys include calculating mean, variance and covariance. It is important to note that summaries are done with respect to some axis of our dataset. Typically they are done column-wise as that is the recommended direction for observations. They also reduce the size of our dataset.
Summaries can be visualized in the following way:
Our parsed file contains columns for date, year, month, month name, the size of files in two different formats, the number of files and the total space used by the files. Let’s say we’re interested in the how the number of file has increased each year. To do this, we’ll first limit our focus on the relevant columns.
# Drop rows with NaNs (invalid years)
newfiles_relevant = filesizes.dropna(axis=0)
# Pick relevant columns
newfiles_relevant = newfiles_relevant.loc[:,['Year','Files']]
newfiles_relevant.head()
Year Files
287 2010.0 5
451 2010.0 3
627 2010.0 27
822 2010.0 136
1057 2010.0 208
newfiles_relevant <- filesizes %>%
# Drop rows with NaNs (invalid years)
drop_na() %>%
# Pick relevant columns
select(Year, Files) %>%
# Change year to category for prettier plotting
mutate(Year=as.factor(Year))
head(newfiles_relevant)
A tibble: 6 × 2 Year Files
<fct> <dbl>
2010 5
2010 3
2010 27
2010 136
2010 208
2010 653
Now, we’ll want to group our data based on the year-column (Year
) and
calculate the total number of files (Files
) across all rows (all dates
and files sizes).
print(newfiles_relevant.shape)
newfiles_yearly_sum = newfiles_relevant.groupby('Year').agg('sum')
print(newfiles_yearly_sum.shape)
newfiles_yearly_sum.head()
(4698, 2)
(11, 1)
Files
Year
2010.0 5590287
2011.0 13197038
2012.0 17099900
2013.0 14755151
2014.0 26329321
glimpse(newfiles_relevant)
newfiles_yearly_sum <- newfiles_relevant %>%
group_by(Year) %>%
summarize(Files=sum(Files))
glimpse(newfiles_yearly_sum)
head(newfiles_yearly_sum)
Year Files
2010 5590287
2011 13197038
2012 17099900
2013 14755151
2014 26329321
2015 24896331
In Python we see that the output of agg is still grouped and for plotting, we’ll want to reset the grouping. R summarise removes the last layer of groupings, but let’s verify that the data is ungrouped.
newfiles_yearly_sum.reset_index(inplace=True)
newfiles_yearly_sum.head()
Year Files
0 2010.0 5590287
1 2011.0 13197038
2 2012.0 17099900
3 2013.0 14755151
4 2014.0 26329321
newfiles_yearly_sum <- newfiles_yearly_sum %>%
ungroup()
head(newfiles_yearly_sum)
Year Files
2010 5590287
2011 13197038
2012 17099900
2013 14755151
2014 26329321
2015 24896331
Let’s plot this data in a bar plot:
newfiles_yearly_sum['Year'] = newfiles_yearly_sum['Year'].astype(int).astype('category')
sb.barplot(x='Year', y='Files', data=newfiles_yearly_sum, ci=None)
options(repr.plot.width=6, repr.plot.height=4)
newfiles_yearly_sum %>%
ggplot(aes(x=Year, y=Files, fill=Year)) +
geom_col()
Creating a function for many different summaries¶
Let’s create a function for this workflow so that we can easily do similar calculations with various different groups.
def aggregate_filesize_data(data, groupings, targets, agg_function):
# Drop rows with NaNs (invalid years)
data_relevant = data.dropna(axis=0)
# Pick relevant columns
data_relevant = data_relevant.loc[:, groupings + targets]
# Change grouping to category for prettier plotting
data_relevant[groupings] = data_relevant[groupings].astype('category')
# Aggregate data
data_aggregated = data_relevant.groupby(groupings).agg(agg_function).reset_index()
return data_aggregated
newfiles_yearly_sum = aggregate_filesize_data(filesizes, ['Year'], ['Files'], 'sum')
newfiles_yearly_sum.head()
Year Files
0 2010.0 5590287
1 2011.0 13197038
2 2012.0 17099900
3 2013.0 14755151
4 2014.0 26329321
aggregate_filesize_data <- function(data, grouping, target, agg_function) {
data_relevant <- data %>%
# Drop rows with NaNs (invalid years)
drop_na() %>%
# Pick relevant columns
select_at(vars(c(grouping, target))) %>%
# Change grouping to category for prettier plotting
mutate_at(vars(grouping), as.factor)
# Aggregate data
data_aggregated <- data_relevant %>%
group_by_at((grouping)) %>%
summarize_at(vars(target), agg_function) %>%
ungroup()
return(data_aggregated)
}
newfiles_yearly_sum <- aggregate_filesize_data(filesizes, c('Year'), c('Files'), sum)
head(newfiles_yearly_sum)
Year Files
2010 5590287
2011 13197038
2012 17099900
2013 14755151
2014 26329321
2015 24896331
Now we can use this function to create the following plots:
Yearly new files
Yearly new file space usage
Monthly new files
Monthly new file space usage
From these we can see the following:
Both the number of files and the space usage are growing non-linearly as the number of new files and number of new bytes used are growing linearly.
July seems to be the month when a lot of new files are created, but it is not the month when the largest files are created. Something strange is definitely happening there.
yearly_sum = aggregate_filesize_data(filesizes, ['Year'], ['Files', 'SpaceUsage'], 'sum')
monthly_sum = aggregate_filesize_data(filesizes, ['Month'], ['Files', 'SpaceUsage'], 'sum')
yearly_sum['Year'] = yearly_sum['Year'].astype(int).astype('category')
print(yearly_sum.head())
print(monthly_sum.head())
fig, ((ax1, ax2, ax3, ax4))=plt.subplots(nrows=4, figsize=(8,16))
sb.barplot(x='Year', y='Files', data=yearly_sum, ci=None, ax=ax1)
sb.barplot(x='Year', y='SpaceUsage', data=yearly_sum, ci=None, ax=ax2)
sb.barplot(x='Month', y='Files', data=monthly_sum, ci=None, ax=ax3)
sb.barplot(x='Month', y='SpaceUsage', data=monthly_sum, ci=None, ax=ax4)
plt.tight_layout()
Year Files SpaceUsage
0 2010 5590287 2260716407068
1 2011 13197038 7000732111463
2 2012 17099900 15475575370580
3 2013 14755151 15445375302767
4 2014 26329321 42530364324322
Month Files SpaceUsage
0 January 34921070 43131219269056
1 February 35707864 71022501061692
2 March 25494722 56516865081262
3 April 31224476 75382094990077
4 May 37816173 75338621861676
yearly_sum <- aggregate_filesize_data(filesizes, c('Year'), c('Files', 'SpaceUsage'), sum)
monthly_sum <- aggregate_filesize_data(filesizes, c('Month'), c('Files', 'SpaceUsage'), sum)
head(yearly_sum)
head(monthly_sum)
print(yearly_sum %>%
ggplot(aes(x=Year, y=Files, fill=Year)) +
geom_col())
print(yearly_sum %>%
ggplot(aes(x=Year, y=SpaceUsage, fill=Year)) +
geom_col())
print(monthly_sum %>%
ggplot(aes(x=Month, y=Files, fill=Month)) +
geom_col())
print(monthly_sum %>%
ggplot(aes(x=Month, y=SpaceUsage, fill=Month)) +
geom_col())
Year Files SpaceUsage
2010 5590287 2,260716e+12
2011 13197038 7,000732e+12
2012 17099900 1,547558e+13
2013 14755151 1,544538e+13
2014 26329321 4,253036e+13
2015 24896331 3,096538e+13
Month Files SpaceUsage
Jan 34921070 4,313122e+13
Feb 35707864 7,102250e+13
Mar 25494722 5,651687e+13
Apr 31224476 7,538209e+13
May 37816173 7,533862e+13
Jun 33804495 7,010947e+13
Curve fitting - Fitting functions into data¶
Determining the growth rate of file space usage¶
A classic data analysis task is fitting a function to data. This is very common in situations where the relationship between two variables can be derived from a mathematical model (e.g physic) or when the data seems to come from a distribution with an easily writable distribution function or cumulative distribution function (e.g. gaussian).
Let’s consider the cumulative space usage of the file system. We can calculate these values from our previously calculated summaries:
yearly_cumsum = yearly_sum.copy()
yearly_cumsum.loc[:,['Files','SpaceUsage']] = yearly_cumsum[['Files','SpaceUsage']].cumsum()
yearly_cumsum.head()
Year Files SpaceUsage
0 2010 5590287 2260716407068
1 2011 18787325 9261448518531
2 2012 35887225 24737023889111
3 2013 50642376 40182399191878
4 2014 76971697 82712763516200
yearly_cumsum <- yearly_sum %>%
mutate(
Files=cumsum(Files),
SpaceUsage=cumsum(SpaceUsage)
)
head(yearly_cumsum)
Year Files SpaceUsage
2010 5590287 2,260716e+12
2011 18787325 9,261449e+12
2012 35887225 2,473702e+13
2013 50642376 4,018240e+13
2014 76971697 8,271276e+13
2015 101868028 1,136781e+14
Let’s plot the space usage as a a bar chart.
fig, ax = plt.subplots(figsize=(8,8))
sb.barplot(x='Year', y='SpaceUsage', data=yearly_cumsum, ci=None, ax=ax)
options(repr.plot.width=6, repr.plot.height=4)
yearly_cumsum %>%
ggplot(aes(x=Year, y=SpaceUsage, fill=Year)) +
geom_col()
Based on this graph the growth looks exponential. If this is the case, taking a logarithm of our data should result in a linear plot. Let’s do just that:
yearly_cumsum['SpaceUsageLog2'] = yearly_cumsum['SpaceUsage'].apply(np.log2)
fig, ax = plt.subplots(figsize=(6,4))
sb.lineplot(x='Year', y='SpaceUsageLog2', data=yearly_cumsum, ci=None, ax=ax)
options(repr.plot.width=6, repr.plot.height=4)
yearly_cumsum <- yearly_cumsum %>%
mutate(SpaceUsageLog2=log2(SpaceUsage))
yearly_cumsum %>%
mutate(Year=as.numeric(as.character(Year))) %>%
ggplot(aes(x=Year, y=SpaceUsageLog2)) +
geom_line()
Here we have used logarithm with base 2 as that gives us a simple way of interpreting our y-axis: each time the y-axis grows by one, our data doubles. Looking at the plot it looks like the growth in last seven years (2014-2020) has been steady. Let’s use those years for our fit.
fit_dataset = yearly_cumsum.copy()
fit_dataset['Year'] = fit_dataset['Year'].astype(int)
fit_dataset = fit_dataset[fit_dataset['Year'] > 2013]
fig, ax = plt.subplots(figsize=(6,4))
sb.scatterplot(x='Year', y='SpaceUsageLog2', data=fit_dataset, ci=None, ax=ax)
fit_dataset <- yearly_cumsum %>%
mutate(Year=as.numeric(as.character(Year))) %>%
filter(Year > 2013)
options(repr.plot.width=6, repr.plot.height=4)
fit_dataset %>%
ggplot(aes(x=Year, y=SpaceUsageLog2, color=Year)) +
geom_point()
Fitting a linear model¶
Linear models (or linear regressions) are commonly used in statistics to explain relationships between different numerical variables. In our case, as the data was linear, we can also use them as fitting models into our data.
from sklearn import linear_model
# Set up a linear model
model1 = linear_model.LinearRegression()
# Fit to the linear model. Notice that first argument needs to be 2D.
model1.fit(fit_dataset[['Year']],fit_dataset['SpaceUsageLog2'])
# Print model coefs
print(model1.coef_, model1.intercept_)
[0.52740083] -1015.9746062919537
library(modelr)
library(broom)
options(repr.plot.width=12, repr.plot.height=4)
# Set up a linear model
lm_formula <- formula(SpaceUsageLog2 ~ Year)
# Fit to the linear model. Notice that first argument needs to be 2D.
model1 <- lm(lm_formula, data=fit_dataset)
# Print model coefs
print(model1)
# Get tidy version the coefficients
coefs <- model1 %>%
tidy() %>%
print()
Call:
lm(formula = lm_formula, data = fit_dataset)
Coefficients:
(Intercept) Year
-1015,9746 0,5274
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1016. 34.5 -29.5 0.000000843
2 Year 0.527 0.0171 30.9 0.000000671
Let’s create predictions based on our model and plot the data with our predictions.
# Predict based on our model
fit_dataset['SpaceUsagePredictedLog2'] = model1.predict(fit_dataset[['Year']])
# Plot the data and our model
sb.scatterplot(x='Year', y='SpaceUsageLog2', data=fit_dataset, ci=None)
sb.lineplot(x='Year', y='SpaceUsagePredictedLog2', data=fit_dataset, color='tomato', label='Fit with slope: %f' % model1.coef_)
options(repr.plot.width=12, repr.plot.height=4)
# Predict based on our model
fit_dataset <- fit_dataset %>%
add_predictions(model1, var='SpaceUsagePredictedLog2')
slope <- coefs %>%
filter(term == 'Year') %>%
select(estimate) %>%
as.numeric()
slope_legend <- sprintf('Fit with slope: %.3f',slope)
# Plot the data and our model
fit_dataset %>%
ggplot(aes(x=Year, y=SpaceUsageLog2)) +
geom_point(aes(color='Data')) +
geom_line(aes(y=SpaceUsagePredictedLog2, color=slope_legend)) +
scale_color_discrete(name='')
We could have also used a simple least squares curve fitting with a function we have defined ourselves.
from scipy.optimize import curve_fit
# Set up a linear model
def model2(x, a, b):
return a*x + b
# Fit to the linear model
coefs,_ = curve_fit(model2, fit_dataset['Year'], fit_dataset['SpaceUsageLog2'])
# Print model coefs
print(coefs)
# Predict based on our model
fit_dataset['SpaceUsagePredictedLog2'] = model2(fit_dataset['Year'], *coefs)
# Plot the data and our model
sb.scatterplot(x='Year', y='SpaceUsageLog2', data=fit_dataset, ci=None)
sb.lineplot(x='Year', y='SpaceUsagePredictedLog2', data=fit_dataset, color='tomato', label='Fit with slope: %f' % coefs[0])
[ 5.27400831e-01 -1.01597461e+03]
# Set up a linear model
formula2 <- SpaceUsageLog2 ~ a * Year + b
# Fit to the linear model
model2 <- nls(formula2, data=fit_dataset)
# Predict based on our model
fit_dataset <- fit_dataset %>%
add_predictions(model2, var='SpaceUsagePredictedLog2')
# Get tidy version the coefficients
coefs <- model2 %>%
tidy() %>%
print()
# Predict based on our model
fit_dataset <- fit_dataset %>%
add_predictions(model1, var='SpaceUsagePredictedLog2')
# Format legend text
slope <- coefs %>%
filter(term == 'a') %>%
select(estimate) %>%
as.numeric()
slope_legend <- sprintf('Fit with slope: %.3f', slope)
# Plot the data and our model
fit_dataset %>%
ggplot(aes(x=Year, y=SpaceUsageLog2)) +
geom_point(aes(color='Data')) +
geom_line(aes(y=SpaceUsagePredictedLog2, color=slope_legend)) +
scale_color_discrete(name='')
Warning message in nls(formula2, data = fit_dataset):
“No starting values specified for some parameters.
Initializing ‘a’, ‘b’ to '1.'.
Consider specifying 'start' or using a selfStart model”
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 0.527 0.0171 30.9 0.000000671
2 b -1016. 34.5 -29.5 0.000000843
Both of these models produced identical coefficients. Let’s create a function that can do the fitting for us.
fit_dataset = yearly_cumsum.copy()
fit_dataset['Year'] = fit_dataset['Year'].astype(int)
def fit_exponential(dataset, x_column, y_column, fit_range=None):
fit_dataset = dataset
# If fit_range is defined, limit fit to that range
if fit_range is not None:
fit_dataset = fit_dataset.loc[fit_range, :]
# Set up a linear model
def linear_model(x, a, b):
return a*x + b
y_log = np.log2(fit_dataset[y_column])
# Fit to the linear model
coefs,_ = curve_fit(linear_model, fit_dataset[x_column], y_log)
# Predict based on our model
dataset['Predicted'] = 2 ** linear_model(dataset[x_column], *coefs)
return dataset, coefs
fit_dataset_final, coefs = fit_exponential(fit_dataset, 'Year', 'SpaceUsage', fit_dataset['Year'] > 2013)
sb.scatterplot(x='Year', y='SpaceUsage', data=fit_dataset_final, ci=None)
sb.lineplot(x='Year', y='Predicted', data=fit_dataset_final, color='tomato', label='Fit with exponent: %f' % coefs[0])
fit_exponential <- function(dataset, x_column, y_column, fit_range=NULL) {
fit_dataset <- dataset
# If fit_range is defined, limit fit to that range
if (!is.null(fit_range)) {
fit_dataset <- fit_dataset[fit_range,]
}
# Set up a linear model
linear_formula <- as.formula(paste0(y_column, ' ~ a * ', x_column,' + b'))
fit_dataset[y_column] <- log2(fit_dataset[y_column])
# Fit to the linear model
linear_model <- nls(linear_formula, data=fit_dataset)
# Get tidy version the coefficients
coefs <- linear_model %>%
tidy()
# Predict based on our model
return_dataset <- dataset %>%
add_predictions(linear_model, var='Predicted') %>%
mutate(Predicted=2 ^ Predicted)
return(list(data=return_dataset, coefs=coefs))
}
fit_dataset <- yearly_cumsum %>%
mutate(Year=as.numeric(as.character(Year)))
# Fit exponential using our function
output <- fit_exponential(fit_dataset, 'Year', 'SpaceUsage', fit_range=fit_dataset['Year']>2013)
fit_dataset_final <- output$data
coefs <- output$coefs
# Format legend text
slope <- coefs %>%
filter(term == 'a') %>%
select(estimate) %>%
as.numeric()
slope_legend <- sprintf('Fit with exponent: %.3f',slope)
options(repr.plot.width=12, repr.plot.height=4)
# Plot the data and our model
fit_dataset_final %>%
ggplot(aes(x=Year, y=SpaceUsage)) +
geom_point(aes(color='Data')) +
geom_line(aes(y=Predicted, color=slope_legend)) +
scale_color_discrete(name='')
Using bootstrapping/resampling methods for the calculation of statistical moments¶
Quick overview of bootstrapping¶
Bootstrapping methods are commonly used to calculate statistical moments (mean, variance, etc.) from a sample distribution obtained from raw data.
The basic idea of bootstrapping methods is that if you have a sample distribution and you want to calculate e.g. distribution for the sample mean, you can take lots of resamples from the distribution with replacement and calculate means for those resamples. Now the distribution of these means will approach the distribution of the sample mean due to the law of large numbers.
Let’s use these methods to calculate the mean file size. First, we need to get
a grouping based on both Year
and BytesLog2
.
# Drop rows with NaNs (invalid years)
newfiles_relevant2 = filesizes.dropna(axis=0)
# Pick relevant columns
newfiles_relevant2 = newfiles_relevant2.loc[:,['Year','BytesLog2','Files']]
# Aggregate based on Year and BytesLog2
newfiles_yearly_sum2 = newfiles_relevant2.groupby(['Year','BytesLog2']).agg('sum')
newfiles_yearly_sum2.head()
Files
Year BytesLog2
2010.0 0 124
1 1632
2 5626
3 26287
4 65074
newfiles_relevant2 <- filesizes %>%
# Drop rows with NaNs (invalid years)
drop_na() %>%
# Pick relevant columns
select(Year, BytesLog2, Files) %>%
# Aggregate based on Year and BytesLog2
group_by(Year, BytesLog2) %>%
summarize(Files=sum(Files))
head(newfiles_relevant2)
Year BytesLog2 Files
2010 0 124
2010 1 1632
2010 2 5626
2010 3 26287
2010 4 65074
2010 5 202543
From here we can see that our data is grouped in two different layers: first
in terms of Year
and then in terms of BytesLog2
. Summation is
afterwards done for Files
.
Now we can notice that because our function aggregate_filesize_data
took
its arguments as lists, we can use it to do these aggregations as well. We can
use it to get our aggregated data and plot the size distribution of new files
for year 2020:
yearly_bytes_sum = aggregate_filesize_data(filesizes, ['Year','BytesLog2'], ['Files', 'SpaceUsage'], 'sum')
bytes_2020 = yearly_bytes_sum[yearly_bytes_sum['Year'] == 2020]
plt.figure(figsize=(12,6))
sb.barplot(x='BytesLog2', y='Files', data=bytes_2020, ci=None)
plt.title(2020)
plt.tight_layout()
yearly_bytes_sum = aggregate_filesize_data(filesizes, c('Year','BytesLog2'), c('Files', 'SpaceUsage'), sum)
bytes_2020 <- yearly_bytes_sum %>%
filter(Year == 2020)
bytes_2020 %>%
ggplot(aes(x=BytesLog2, y=Files, fill=BytesLog2)) +
geom_col() +
theme(legend.position = "none")
Let’s use np.random.choice (Python) / sample (R) for the sampling because these functions are much faster when we’re creating hundreds to thousands of samples (resampling functions of Pandas/Tidyverse are designed for few random samples).
Now we’ll want to pick from all available byte sizes with replacement (each byte size can be picked more than once) and we’ll want to weight the picking probabilities with the distribution of our sample data (new files created on 2020).
# Pick target data column and convert it to integer
target_data = bytes_2020['BytesLog2'].copy().astype('int')
# Pick weight data column
weight_data = bytes_2020['Files'].copy()
# IMPORTANT:
# There might be categories in BytesLog2 that do not contain any data.
# We'll have to fill zeros to those rows of Files.
weight_data.fillna(0, inplace=True)
# Normalize weight_data into probabilities
weight_data = weight_data/weight_data.sum()
print(target_data.head())
print(weight_data.head())
430 0
431 1
432 2
433 3
434 4
Name: BytesLog2, dtype: int64
430 0.000327
431 0.001940
432 0.001471
433 0.007406
434 0.014570
Name: Files, dtype: float64
# Pick target data column and convert it to integer
# IMPORTANT:
# Do notice that we'll have to first convert our target
# into characters as we do not want convert factor ENCODING,
# but the actual decoded DATA
target_data <- as.numeric(as.character(bytes_2020[['BytesLog2']]))
# Pick weight data column
weight_data <- bytes_2020[['Files']]
# Normalize weight_data into probabilities
weight_data <- weight_data/sum(weight_data)
print(head(target_data))
print(head(weight_data))
[1] 0 1 2 3 4 5
[1] 0,0003271367 0,0019404559 0,0014705603 0,0074056601 0,0145700668
[6] 0,0156263905
Now we can create a vector of means and fill it with random resampled means. The sample mean of our original distribution is then the mean of this vector. By looking at our plot we can see that the sample mean corresponds well with the peak of the distribution.
# Create means vector
means = np.zeros(10, dtype=np.float64)
for i in range(10):
# Calculate resampled mean
means[i] = np.mean(np.random.choice(target_data, 100, replace=True, p=weight_data))
print(means[:10])
print('Estimated sample mean:', np.mean(means))
[13.06 13.27 13.57 13.45 12.7 13.05 13.28 12.94 12.91 14.07]
Estimated sample mean: 13.229999999999999
# Create means vector
means <- numeric(10)
for (i in seq(10)) {
# Calculate resampled mean
means[[i]] <- mean(sample(target_data, 100, replace=TRUE, prob=weight_data))
}
print(means)
print(paste0('Estimated sample mean: ', mean(means)))
[1] 12,19 12,19 12,80 12,90 13,18 13,48 12,97 12,39 13,49 12,57
[1] "Estimated sample mean: 12,816"
Let’s now create a function for this bootstrapping feature:
def get_bootstrapped_means(dataset, target_col, weight_col, n_means=1000):
# Pick relevant columns
df = dataset[[target_col, weight_col]].copy()
# Pick target data column
target_data = df[target_col]
# Pick weight data column
weight_data = df[weight_col]
# Fill zeros to those byte sizes that are not present in the Files-data
weight_data.fillna(0, inplace=True)
# Normalize weight_data into probabilities
weight_data = weight_data/weight_data.sum()
# Create means vector
means = np.zeros(n_means, dtype=np.float64)
for i in range(n_means):
# Calculate resampled mean
means[i] = np.mean(np.random.choice(target_data, 100, replace=True, p=weight_data))
return means
bootstrapped_means = get_bootstrapped_means(bytes_2020, 'BytesLog2', 'Files', n_means=1000)
print(bootstrapped_means[:10])
print('Estimated sample mean:', np.mean(bootstrapped_means))
[12.59 13.9 13.09 13.34 12.65 13.1 13.16 12.98 13.28 12.74]
Estimated sample mean: 13.225010000000001
get_bootstrapped_means <- function(dataset, target_col, weight_col, n_means=1000) {
# Pick relevant columns
# Pick target data column and convert it to integer
target_data <- as.numeric(as.character(dataset[[target_col]]))
# Pick weight data column
weight_data <- dataset[[weight_col]]
weight_data <- weight_data/sum(weight_data)
# Create means vector
means <- numeric(n_means)
for (i in seq(n_means)) {
# Calculate resampled mean
means[[i]] <- mean(sample(target_data, 100, replace=TRUE, prob=weight_data))
}
return(means)
}
means <- get_bootstrapped_means(bytes_2020, 'BytesLog2', 'Files', n_means=1000)
print(head(means))
print(paste0('Estimated sample mean: ', mean(means)))
[1] 13,47 13,78 13,11 12,75 13,50 12,83
[1] "Estimated sample mean: 13,2317"
Using nested dataframes to help with bootstrapping¶
Models that need multiple columns (or even the full dataset), but need to be grouped along some column, are usually easier to run using nested dataframes. When using nested dataframes we split our initial data based on a grouping and apply some function for each of these dataframes. The result of this function can be a dataframe.
This can be visualized in the following way:
Lets create a nested view of our data (yearly_bytes_sum
):
bootstrapped_means = yearly_bytes_sum.groupby('Year').apply(lambda x: pd.Series({'data': x}))
bootstrapped_means.head()
data
Year
2010.0 Year BytesLog2 Files SpaceUsage ...
2011.0 Year BytesLog2 Files SpaceUsage ...
2012.0 Year BytesLog2 Files SpaceUsage...
2013.0 Year BytesLog2 Files SpaceUsage...
2014.0 Year BytesLog2 Files SpaceUsage...
yearly_bytes_sum_nested <- yearly_bytes_sum %>%
group_by(Year) %>%
nest()
glimpse(yearly_bytes_sum_nested)
Observations: 11
Variables: 2
$ Year <fct> 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020
$ data <list> [<tbl_df[37 x 3]>, <tbl_df[37 x 3]>, <tbl_df[38 x 3]>, <tbl_df[…
Now we can create bootstrapped means for all of the years with apply/map-statements.
bootstrapped_means['SampledMeans'] = bootstrapped_means['data'].apply(lambda x: get_bootstrapped_means(x, 'BytesLog2', 'Files', n_means=5))
bootstrapped_means.drop('data', axis=1, inplace=True)
bootstrapped_means.head()
SampledMeans
Year
2010.0 [13.07, 13.32, 13.34, 12.34, 13.03]
2011.0 [13.73, 14.56, 14.2, 13.89, 13.71]
2012.0 [9.9, 9.73, 10.42, 10.63, 10.59]
2013.0 [13.52, 13.35, 13.14, 13.22, 14.02]
2014.0 [14.01, 14.49, 13.68, 14.05, 13.72]
bootstrapped_means <- yearly_bytes_sum_nested %>%
mutate(SampledMeans=map(data, function(x) get_bootstrapped_means(x, 'BytesLog2', 'Files', n_means=5))) %>%
select(-data)
print(glimpse(bootstrapped_means))
head(bootstrapped_means,1)
Observations: 11
Variables: 2
$ Year <fct> 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 20…
$ SampledMeans <list> [<12,94, 13,12, 12,74, 13,03, 13,21>, <14,33, 14,01, 14…
# A tibble: 11 x 2
Year SampledMeans
<fct> <list>
1 2010 <dbl [5]>
2 2011 <dbl [5]>
3 2012 <dbl [5]>
4 2013 <dbl [5]>
5 2014 <dbl [5]>
6 2015 <dbl [5]>
7 2016 <dbl [5]>
8 2017 <dbl [5]>
9 2018 <dbl [5]>
10 2019 <dbl [5]>
11 2020 <dbl [5]>
Year SampledMeans
2010 12,94, 13,12, 12,74, 13,03, 13,21
Now we can calculate means for each of these bootstrapped means:
bootstrapped_means['Mean'] = bootstrapped_means['SampledMeans'].apply(np.mean)
bootstrapped_means.head()
SampledMeans Mean
Year
2010.0 [13.07, 13.32, 13.34, 12.34, 13.03] 13.020
2011.0 [13.73, 14.56, 14.2, 13.89, 13.71] 14.018
2012.0 [9.9, 9.73, 10.42, 10.63, 10.59] 10.254
2013.0 [13.52, 13.35, 13.14, 13.22, 14.02] 13.450
2014.0 [14.01, 14.49, 13.68, 14.05, 13.72] 13.990
bootstrapped_means <- bootstrapped_means %>%
mutate(Mean=map(SampledMeans, mean))
head(bootstrapped_means, 1)
Year SampledMeans Mean
2010 12,94, 13,12, 12,74, 13,03, 13,21 13,008
Let’s create a function for this procedure so that we can run it for multiple different columns:
def bootstrap_byteslog2_mean(dataset, group_variable, target_variable, n_means=1000):
bootstrapping_function = lambda x: get_bootstrapped_means(x, 'BytesLog2', target_variable, n_means=n_means)
bootstrapped_means = dataset.groupby(target_variable).apply(lambda x: pd.Series({'data': x}))
bootstrapped_means['SampledMeans'] = bootstrapped_means['data'].apply(bootstrapping_function)
bootstrapped_means['Mean'] = bootstrapped_means['SampledMeans'].apply(np.mean)
bootstrapped_means.drop('data', axis=1, inplace=True)
return bootstrapped_means
bootstrapped_yearly_means = bootstrap_byteslog2_mean(yearly_bytes_sum, 'Year', 'Files', n_means=1000)
bootstrapped_yearly_means.head()
SampledMeans Mean
Year
2010.0 [12.55, 12.83, 14.01, 12.28, 12.86, 13.04, 13.... 12.97764
2011.0 [14.27, 14.27, 13.89, 13.97, 13.81, 13.76, 13.... 14.05083
2012.0 [10.69, 10.88, 11.12, 10.09, 11.15, 10.84, 10.... 10.65042
2013.0 [13.39, 13.01, 13.22, 13.61, 12.81, 12.82, 13.... 13.39958
2014.0 [14.04, 13.96, 14.5, 13.21, 13.64, 14.68, 14.3... 14.03843
bootstrap_byteslog2_mean <- function(dataset, group_variable, target_variable, n_means=1000) {
bootstrapping_function <- function(x) get_bootstrapped_means(x, 'BytesLog2', target_variable, n_means=n_means)
bootstrapped_means <- dataset %>%
group_by_at(vars(group_variable)) %>%
nest() %>%
mutate(
SampledMeans=map(data, bootstrapping_function),
Mean=map(SampledMeans, function(x) mean(x[['SampledMeans']]))) %>%
select(-data)
return(bootstrapped_means)
}
bootstrapped_yearly_means = bootstrap_byteslog2_mean(yearly_bytes_sum, 'Year', 'Files', n_means=1000)
glimpse(bootstrapped_yearly_means)
Observations: 11
Variables: 3
$ Year <fct> 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 20…
$ SampledMeans <list> [<tbl_df[1000 x 1]>, <tbl_df[1000 x 1]>, <tbl_df[1000 x…
$ Mean <list> [12,97675, 14,0463, 10,65216, 13,38878, 14,04548, 11,73…
For plotting we can unstack the SampledMeans
-dataframe.
bootstrapped_yearly_means_distribution = bootstrapped_yearly_means.drop('Mean', axis=1).explode('SampledMeans').reset_index()
bootstrapped_yearly_means_distribution.head()
Year SampledMeans
0 2010.0 12.55
1 2010.0 12.83
2 2010.0 14.01
3 2010.0 12.28
4 2010.0 12.86
bootstrapped_yearly_means_distribution <- bootstrapped_yearly_means %>%
select(-Mean) %>%
unnest()
head(bootstrapped_yearly_means_distribution)
Year SampledMeans
2010 12,93
2010 12,80
2010 13,20
2010 12,53
2010 12,95
2010 13,49
Now we can plot distributions for the data and for the sample mean.
for (year, real_data), (year_sampled, bootstrapped_data) in zip(yearly_bytes_sum.groupby('Year'), bootstrapped_yearly_means_distribution.groupby('Year')):
figure, (ax1, ax2) = plt.subplots(1,2,figsize=(16,3))
figure.suptitle(int(year))
sb.barplot(x='BytesLog2', y='Files', data=real_data, ci=None, ax=ax1)
sb.histplot(x='SampledMeans', binwidth=0.5, data=bootstrapped_data, ax=ax2)
plt.xlim(left=min(yearly_bytes_sum['BytesLog2']), right=max(yearly_bytes_sum['BytesLog2']))
plt.tight_layout()
options(repr.plot.width=8, repr.plot.height=16)
x_limits <- range(as.numeric(levels(yearly_bytes_sum[['BytesLog2']])))
yearly_bytes_sum %>%
ggplot(aes(x=as.factor(BytesLog2), y=Files, fill=Year)) +
geom_bar(stat='identity') +
ylab('N') +
xlab('Bytes (log2)') +
ggtitle('Yearly files') +
facet_grid(rows=vars(Year))
bootstrapped_yearly_means_distribution %>%
ggplot(aes(x=SampledMeans, fill=Year)) +
geom_histogram(binwidth=0.1) +
ylab('Number of bootstrapped means') +
xlab('Mean of Bytes (log2)') +
xlim(x_limits) +
ggtitle('Distribution of means') +
facet_grid(rows=vars(Year))
Let’s use our new functions for monthly data as well:
monthly_bytes_sum = aggregate_filesize_data(filesizes, ['Month','BytesLog2'], ['Files', 'SpaceUsage'], 'sum')
bootstrapped_monthly_means = bootstrap_byteslog2_mean(monthly_bytes_sum, 'Month', 'Files', n_means=1000)
bootstrapped_monthly_means_distribution = bootstrapped_monthly_means.drop('Mean', axis=1).explode('SampledMeans').reset_index()
bootstrapped_monthly_means_distribution.head()
for (month, real_data), (month_sampled, bootstrapped_data) in zip(monthly_bytes_sum.groupby('Month'), bootstrapped_monthly_means_distribution.groupby('Month')):
figure, (ax1, ax2) = plt.subplots(1,2,figsize=(16,3))
figure.suptitle(month)
sb.barplot(x='BytesLog2', y='Files', data=real_data, ci=None, ax=ax1)
sb.histplot(x='SampledMeans', binwidth=0.5, data=bootstrapped_data, ax=ax2)
plt.xlim(left=min(yearly_bytes_sum['BytesLog2']), right=max(yearly_bytes_sum['BytesLog2']))
plt.tight_layout()
monthly_bytes_sum <- aggregate_filesize_data(filesizes, c('Month','BytesLog2'), c('Files', 'SpaceUsage'), sum)
bootstrapped_monthly_means = bootstrap_byteslog2_mean(monthly_bytes_sum, 'Month', 'Files', n_means=1000)
bootstrapped_monthly_means_distribution <- bootstrapped_monthly_means %>%
select(-Mean) %>%
unnest()
options(repr.plot.width=10, repr.plot.height=16)
x_limits <- range(as.numeric(levels(monthly_bytes_sum[['BytesLog2']])))
monthly_bytes_sum %>%
ggplot(aes(x=as.factor(BytesLog2), y=Files, fill=Month)) +
geom_bar(stat='identity') +
ylab('N') +
xlab('Bytes (log2)') +
ggtitle('Yearly files') +
facet_grid(rows=vars(Month))
bootstrapped_monthly_means_distribution %>%
ggplot(aes(x=SampledMeans, fill=Month)) +
geom_histogram(binwidth=0.1) +
ylab('Number of bootstrapped means') +
xlab('Mean of Bytes (log2)') +
xlim(x_limits) +
ggtitle('Distribution of means') +
facet_grid(rows=vars(Month))