Suppose you have a table of data mapping events and dates to values, and that this data contains gaps in values.
Suppose you want to forward fill these gaps such that, one-by-one, if a value is null, it is replaced by the non-null value preceding.
For example, you might be measuring the total value of an account over time. Saving the same value until that value changes is an inefficient use of space, so you might only measure the value during certain events, like a change in ownership or value.
In that case, to view the value of the account by day, you might want to interpolate dates and then ffill or bfill value to show the account value over time by date.
Date interpolation will be covered in a different guide, but if you already have the dates then you can fill in some values.
This was heavily inspired by Gil Forsyth’s writeup on ffill and bfill on the Ibis GitHub Wiki.
Setup
First, we want to make some mock data. To demonstrate this technique in a non-pandas backend, we will use the DuckDB backend.
Our data will have measurements by date, and these measurements will be grouped by an event id. We will then save this data to data.parquet so we can register that parquet file as a table in our DuckDB connector.
DatabaseTable: data
event_id int64
measured_on date
measurement float64
ffill Strategy
To better understand how we can forward-fill our gaps, let’s take a minute to explain the strategy and then look at the manual result.
We will partition our data by event groups and then sort those groups by date.
Our logic for forward fill is then: let j be an event group sorted by date and let i be a date within j. If i is the first date in j, then continue. If i is not the first date in j, then if measurement in i is null then replace it with measurement for i-1. Otherwise, do nothing.
Let’s take a look at what this means for the first few rows of our data:
event_id measured_on measurement
0 0 2021-06-01 NaN # Since this is the first row of the event group (group 0), do nothing
1 0 2021-06-02 5.0 # Since this is not the first row of the group and is not null: do nothing
4 1 2021-05-05 42.0 # This is the first row of the event group (group 1): do nothing
2 1 2021-06-03 NaN # This is not the first row and is null: replace it (NaN → 42.0)
3 1 2021-06-04 NaN # This is not the first row and is null: replace it (NaN → 42.0)
5 2 2021-05-06 42.0 # This is the first row of the event group (group 2): do nothing
6 2 2021-05-07 NaN # This is not the first row and is null: replace it (NaN → 42.0)
7 2 2021-05-08 11.0 # This is not the first row and is not null: do nothing
8 2 2021-05-09 NaN # This is not the first row and is null: replace it (NaN → 11.0)
9 2 2021-05-10 NaN # This is not the first row and is null: replace it (NaN → 11.0)
10 3 2021-07-11 NaN # This is the first row of the event group (group 3): do nothing
11 3 2021-07-12 NaN # This is not the first row and is null: replace it (NaN → NaN)
Our result should for forward fill should look like this:
event_id
measured_on
measurement
0
0
2021-06-01
NaN
1
0
2021-06-02
5.0
2
1
2021-06-03
5.0
3
1
2021-06-04
5.0
4
1
2021-05-05
42.0
5
2
2021-05-06
42.0
6
2
2021-05-07
42.0
7
2
2021-05-08
11.0
8
2
2021-05-09
11.0
9
2
2021-05-10
11.0
10
3
2021-07-11
11.0
11
3
2021-07-12
11.0
To accomplish this, we will create a window over our event_id to partition our data into groups. We will take these groups and order them by measured_on:
Once we have our window defined, we can flag the first non-null value in an event group using count, as it will count non-null values row-by-row within our group:
To see this a bit clearer: look at rows 0, 1, and 2. Row 0 is NaN and is the first row of the group (event_id = 0), so at row 0 we have 0 non-null values (grouper = 0). Row 1 is not null (5.0) and is the second row the group, so our count has increased by 1 (grouper = 1). Row 2 is the first row of its group (event_id = 1) and is not null, so our count is 1 (grouper = 1).
Skip down to rows 9, 10, and 11. Row 9 is the sixth row of group 2 and there are three non-null values in group 2 before row 9. Therefore the count at row 9 is 3.
Row 10 is the first row of group 3 and is null, therefore its count is 0. Finally: row 11 is the second row of group 3 and is null as well, therefore the count remains 0.
Under this design, we now have another partition.
Our first partition is by event_id. Within each set in that partition, we have a partition by grouper, where each set has up to one non-null value.
Since there less than or equal to one non-null value in each group of ['event_id', 'grouper'], we can fill values by overwriting all values within the group by the max value in the group.
So:
Group by event_id and grouper
Mutate the data along that grouping by populating a new column ffill with the max value of measurement.
result = ( grouped .group_by([grouped.event_id, grouped.grouper]) .mutate(ffill=grouped.measurement.max()) .execute()).sort_values(by=['event_id', 'measured_on']).reset_index(drop=True)result
event_id
measured_on
measurement
grouper
ffill
0
0
2021-06-01
NaN
0
NaN
1
0
2021-06-02
5.0
1
5.0
2
1
2021-05-05
42.0
1
42.0
3
1
2021-06-03
NaN
1
42.0
4
1
2021-06-04
NaN
1
42.0
5
2
2021-05-06
42.0
1
42.0
6
2
2021-05-07
NaN
1
42.0
7
2
2021-05-08
11.0
2
11.0
8
2
2021-05-09
NaN
2
11.0
9
2
2021-05-10
NaN
2
11.0
10
3
2021-07-11
NaN
0
NaN
11
3
2021-07-12
NaN
0
NaN
bfill Strategy
Instead of sorting the dates ascending, we will sort them descending. This is akin to starting at the last row in an event group and going backwards using the same logic outlined above.
Let’s take a look:
event_id measured_on measurement grouper
0 0 2021-06-01 NaN 1 # null, take the previous row value (NaN → 5.0)
1 0 2021-06-02 5.0 1 # last row, do nothing
2 1 2021-05-05 42.0 1 # not null, do nothing
3 1 2021-06-03 NaN 0 # null, take previous row value (NaN → NaN)
4 1 2021-06-04 NaN 0 # last row, do nothing
5 2 2021-05-06 42.0 2 # not null, do nothing
6 2 2021-05-07 NaN 1 # null, take previous row value (NaN → 11.0)
7 2 2021-05-08 11.0 1 # not null, do nothing
8 2 2021-05-09 NaN 0 # null, take previous row value (NaN → NaN)
9 2 2021-05-10 NaN 0 # not null, do nothing
10 3 2021-07-11 NaN 0 # null, take previous row value (NaN → NaN)
11 3 2021-07-12 NaN 0 # last row, do nothing
Codewise, bfill follows the same strategy as ffill, we need to specify order_by to use ibis.desc. This will flip our dates and our counts (therefore our groupers) will start backwards.
And, again, if we take max of our grouper value, we will get the only non-null value if it exists:
result = ( grouped .group_by([grouped.event_id, grouped.grouper]) .mutate(bfill=grouped.measurement.max()) .execute()).sort_values(by=['event_id', 'measured_on']).reset_index(drop=True)result
event_id
measured_on
measurement
grouper
bfill
0
0
2021-06-01
NaN
1
5.0
1
0
2021-06-02
5.0
1
5.0
2
1
2021-05-05
42.0
1
42.0
3
1
2021-06-03
NaN
0
NaN
4
1
2021-06-04
NaN
0
NaN
5
2
2021-05-06
42.0
2
42.0
6
2
2021-05-07
NaN
1
11.0
7
2
2021-05-08
11.0
1
11.0
8
2
2021-05-09
NaN
0
NaN
9
2
2021-05-10
NaN
0
NaN
10
3
2021-07-11
NaN
0
NaN
11
3
2021-07-12
NaN
0
NaN
bfill and ffill without Event Groups
You can bfill and ffill without event groups by ignoring that grouping. Remove all references of event_id and you can treat the entire dataset as one event.
Your window function will increment whenever a new non-null value is observed, creating that partition where each set has up to one non-null value.