Forward and backward fill data using window functions
If you have gaps in your data and need to fill them in using a simple forward fill (given an order, null values are replaced by the value preceding) or backward fill (given an order, null values are replaced by the value following), then you can do this in Ibis:
ffill
# Create a window that orders your series, default ascending
= ibis.window(order_by=data.measured_on, following=0)
win # Create a grouping that is a rolling count of non-null values
# This creates a partition where each set has no more than one non-null value
= data.mutate(grouper=data.measurement.count().over(win))
grouped # Group by your newly-created grouping and, in each set,
# set all values to the one non-null value in that set (if it exists)
= (
result
grouped
.group_by([grouped.grouper])=grouped.measurement.max())
.mutate(ffill
)# execute to get a pandas dataframe, sort values in case your backend shuffles
=["measured_on"]) result.execute().sort_values(by
bfill
# Create a window that orders your series (use ibis.desc to get descending order)
= ibis.window(order_by=ibis.desc(data.measured_on), following=0)
win # Create a grouping that is a rolling count of non-null values
# This creates a partition where each set has no more than one non-null value
= data.mutate(grouper=data.measurement.count().over(win))
grouped # Group by your newly-created grouping and, in each set,
# set all values to the one non-null value in that set (if it exists)
= (
result
grouped
.group_by([grouped.grouper])=grouped.measurement.max())
.mutate(ffill
)# execute to get a pandas dataframe, sort values in case your backend shuffles
=["measured_on"]) result.execute().sort_values(by
If you have an event partition, which means there’s another segment you need to consider for your ffill or bfill operations, you can do this as well:
ffill
with event partition
# Group your data by your event partition and then order your series (default ascending)
= ibis.window(group_by=data.event_id, order_by=data.measured_on, following=0)
win # Create a grouping that is a rolling count of non-null values within each event
# This creates a partition where each set has no more than one non-null value
= data.mutate(grouper=data.measurement.count().over(win))
grouped # Group by your newly-created grouping and, in each set,
# set all values to the one non-null value in that set (if it exists)
= (
result
grouped
.group_by([grouped.event_id, grouped.grouper])=grouped.measurement.max())
.mutate(ffill
)# execute to get a pandas dataframe, sort values in case your backend shuffles
=["event_id", "measured_on"]) result.execute().sort_values(by
bfill
with event partition
# Group your data by your event partition and then order your series (use ibis.desc for desc)
= ibis.window(group_by=data.event_id, order_by=ibis.desc(data.measured_on), following=0)
win # Create a grouping that is a rolling count of non-null values within each event
# This creates a partition where each set has no more than one non-null value
= data.mutate(grouper=data.measurement.count().over(win))
grouped # Group by your newly-created grouping and, in each set,
# set all values to the one non-null value in that set (if it exists)
= (
result
grouped
.group_by([grouped.event_id, grouped.grouper])=grouped.measurement.max())
.mutate(ffill
)# execute to get a pandas dataframe, sort values in case your backend shuffles
=["event_id", "measured_on"]) result.execute().sort_values(by
We wrote a deeper dive into how this works on the Ibis blog.