Analysis of World of Warcraft data

blog
data engineering
duckdb
Author

Tyler White

Published

February 29, 2024

Introduction

I grew up playing games, and with the recent re-release of World of Warcraft Classic, it seems like a perfect time to analyze some in-game data!

This dataset is the product of a Horde player’s diligent recording throughout 2008, capturing the transitional phase between the Burning Crusade and Wrath of the Lich King expansions. Notably, starting November 13, 2008, the data showcases numerous characters venturing into new territories and advancing beyond the former level cap of 70.

Analysis

We’ll determine who logged in the most, who leveled from 70 to 80 the fastest, and what activities these players engaged with based on zones. Let’s get to work.

Getting started

Ibis ships with an examples module, which includes this specific data. We’ll use DuckDB here, but this is possible with other backends, and we encourage you to experiment. DuckDB is the default Ibis backend, so it’ll be easy to use with this example.

You can execute pip install ibis-framework[duckdb,examples] to work with Ibis and the example data.

from ibis.interactive import *

wowah_data = ex.wowah_data_raw.fetch()

wowah_data
┏━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ char   level  race    charclass  zone                    guild  timestamp           ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int32int32stringstringstringint32timestamp(6)        │
├───────┼───────┼────────┼───────────┼────────────────────────┼───────┼─────────────────────┤
│ 594251Orc   Rogue    Orgrimmar             1652008-01-01 00:02:04 │
│ 654949Orc   Hunter   Durotar               -12008-01-01 00:02:04 │
│ 6532514Orc   Warrior  Ghostlands            -12008-01-01 00:02:04 │
│ 6549018Orc   Hunter   Ghostlands            -12008-01-01 00:02:04 │
│  228860Orc   Hunter   Hellfire Peninsula    -12008-01-01 00:02:09 │
│  228960Orc   Hunter   Hellfire Peninsula    -12008-01-01 00:02:09 │
│ 6123968Orc   Hunter   Blade's Edge Mountains2432008-01-01 00:02:14 │
│ 5977269Orc   Warrior  Shadowmoon Valley     352008-01-01 00:02:14 │
│ 2293769Orc   Rogue    Warsong Gulch         2432008-01-01 00:02:14 │
│ 2306269Orc   Shaman   Shattrath City        1032008-01-01 00:02:14 │
│                        │
└───────┴───────┴────────┴───────────┴────────────────────────┴───────┴─────────────────────┘

Getting table info

Let’s learn more about these fields. Are there any nulls we should consider? We can use the info method on our Ibis expression.

wowah_data.info()
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━┓
┃ name       type          nullable  nulls  non_nulls  null_frac  pos  ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━┩
│ stringstringbooleanint64int64float64int8 │
├───────────┼──────────────┼──────────┼───────┼───────────┼───────────┼──────┤
│ char     int32        │ True     │     0108267340.00 │
│ level    int32        │ True     │     0108267340.01 │
│ race     string       │ True     │     0108267340.02 │
│ charclassstring       │ True     │     0108267340.03 │
│ zone     string       │ True     │     0108267340.04 │
│ guild    int32        │ True     │     0108267340.05 │
│ timestamptimestamp(6) │ True     │     0108267340.06 │
└───────────┴──────────────┴──────────┴───────┴───────────┴───────────┴──────┘

We can also use value_counts on specific columns if we want to learn more.

wowah_data.race.value_counts()
┏━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ race       race_count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ stringint64      │
├───────────┼────────────┤
│ Undead   2530156 │
│ Orc      933056 │
│ Troll    1102409 │
│ Blood Elf3929995 │
│ Tauren   2331118 │
└───────────┴────────────┘

We don’t have any missing values, and the data value_counts results match what I would expect.

How about duplicates? We can check the count of unique rows against the total count.

print(wowah_data.count())
print(wowah_data.nunique())
print(wowah_data.count() == wowah_data.nunique())

10826734
10823177
False


So we have some duplicates. What could the duplicate rows be?

We can find them like this.

wowah_duplicates = wowah_data.mutate(
    row_num=ibis.row_number().over(
        ibis.window(group_by=wowah_data.columns, order_by=_.timestamp)
    )
).filter(_.row_num > 0)

wowah_duplicates
┏━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ char   level  race       charclass  zone                guild  timestamp            row_num ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ int32int32stringstringstringint32timestamp(6)int64   │
├───────┼───────┼───────────┼───────────┼────────────────────┼───────┼─────────────────────┼─────────┤
│   34170Undead   Rogue    Terokkar Forest   2042008-06-05 01:57:371 │
│   98070Orc      Hunter   Isle of Quel'Danas792008-10-25 05:05:031 │
│  132170Tauren   Druid    Isle of Quel'Danas42008-04-17 15:38:461 │
│  286670Undead   Priest   Nagrand           1032008-10-23 21:01:321 │
│  431870Undead   Warrior  Karazhan          192008-07-12 18:05:061 │
│ 1131670Undead   Mage     Alterac Valley    1042008-03-22 00:30:481 │
│ 1777470Blood ElfHunter   Shattrath City    2712008-07-12 18:06:121 │
│ 1959870Tauren   Shaman   The Mechanar      1012008-06-05 11:10:571 │
│ 2182870Tauren   Shaman   Durotar           -12008-10-23 19:10:311 │
│ 2248470Tauren   Hunter   Warsong Gulch     3152008-07-12 18:03:231 │
│      │
└───────┴───────┴───────────┴───────────┴────────────────────┴───────┴─────────────────────┴─────────┘

I suspect this data was captured by a single player spamming “/who” in the game, most likely using an AddOn, about every ten minutes. Some players could have been captured twice, depending on how the command was being filtered.

We can go ahead and remove these duplicates.

wowah_data = wowah_data.distinct()

Which player logged in the most?

We mentioned that there was a single player likely capturing these results. Let’s find out who that is.

(
    wowah_data
    .group_by([_.char, _.race, _.charclass])
    .agg(sessions=_.count())
    .order_by(_.sessions.desc())
)
┏━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ char   race       charclass  sessions ┃
┡━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ int32stringstringint64    │
├───────┼───────────┼───────────┼──────────┤
│   182Troll    Hunter   42770 │
│ 57741Undead   Warlock  16237 │
│  1384Undead   Warlock  15878 │
│ 59489Blood ElfPriest   13977 │
│ 62239Undead   Mage     13776 │
│ 62446Blood ElfMage     13011 │
│ 31184Undead   Rogue    12019 │
│ 24126Blood ElfWarlock  11791 │
│ 61105Blood ElfPriest   11731 │
│ 35072Blood ElfPaladin  11399 │
│      │
└───────┴───────────┴───────────┴──────────┘

That Troll Hunter that never exceeded level 1 is likely our person with 42,770 sessions.

Who leveled the fastest from 70–80?

At the end of the year, there were 884 level 80s. Who leveled the fastest?

Finding this answer will involve filtering, grouping, and aggregating to compute each character’s time taken to level from 70 to 80.

Let’s start by creating an expression to filter to only the level 80 characters, then join it to filter and identify only where they were level 70 or 80. We’re only concerned with three columns so that we will select only those.

max_level_chars = wowah_data.filter(_.level == 80).select(_.char).distinct()
wowah_data_filtered = (
    wowah_data
    .join(max_level_chars, "char", how="inner")
    .filter(_.level.isin([70, 80]))
    .select(_.char, _.level, _.timestamp)
)
wowah_data_filtered
┏━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ char   level  timestamp           ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int32int32timestamp(6)        │
├───────┼───────┼─────────────────────┤
│ 62226702008-01-09 22:00:20 │
│ 21828702008-01-09 22:00:30 │
│ 62763702008-01-09 22:01:06 │
│ 27547702008-01-09 22:01:51 │
│  5730702008-01-09 22:02:07 │
│ 34216702008-01-09 22:09:27 │
│ 40951702008-01-09 22:09:32 │
│ 45552702008-01-09 22:10:08 │
│ 19481702008-01-09 22:10:23 │
│ 19085702008-01-09 22:10:23 │
│                        │
└───────┴───────┴─────────────────────┘

Let’s use the where option to help with the aggregation.

level_calc = (
    wowah_data_filtered.group_by(["char"])
    .mutate(
        ts_70=_.timestamp.max(where=_.level == 70),
        ts_80=_.timestamp.min(where=_.level == 80),
    )
    .drop(["level", "timestamp"])
    .distinct()
    .mutate(days_from_70_to_80=(_.ts_80.delta(_.ts_70, "day")))
    .order_by(_.days_from_70_to_80)
)

The data is filtered and grouped by character, and two new columns are created to represent timestamps for levels 70 and 80. Then we drop what we no longer need, get the distinct values, and calculate the time taken to level from 70 to 80. Then we sort it!

level_calc
┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ char   ts_70                ts_80                days_from_70_to_80 ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ int32timestamp(6)timestamp(6)int64              │
├───────┼─────────────────────┼─────────────────────┼────────────────────┤
│ 685442008-11-27 01:58:132008-11-30 16:19:043 │
│  14502008-11-18 02:46:372008-11-21 11:45:353 │
│   3992008-11-18 00:02:402008-11-22 20:23:574 │
│ 862642008-11-20 23:09:172008-11-24 16:35:464 │
│ 517382008-12-03 20:10:002008-12-07 11:26:084 │
│  10032008-11-18 00:29:252008-11-22 23:47:064 │
│ 404832008-11-17 22:35:422008-11-21 15:31:594 │
│ 883312008-12-12 01:47:572008-12-16 10:19:234 │
│ 863962008-12-03 05:04:422008-12-07 17:04:434 │
│ 862652008-11-20 23:28:402008-11-24 07:47:084 │
│      │
└───────┴─────────────────────┴─────────────────────┴────────────────────┘

This isn’t perfect, as I found a case where there was a player who seemed to have quit in March and then returned for the new expansion. They hit 71 before it looks like their login at 70 was captured later. If you’re curious, take a look at char=21951 for yourself.

How did they level?

Let’s grab all the details from the previous result and join it back to get the timestamp and zone data.

leveler_zones = (
    level_calc.join(wowah_data, "char", how="inner")
    .filter(_.timestamp.between(_.ts_70, _.ts_80))
    .group_by([_.char, _.zone])
    .agg(zone_count=_.zone.count())
)
leveler_zones
┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ char   zone               zone_count ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int32stringint64      │
├───────┼───────────────────┼────────────┤
│ 40309Borean Tundra    70 │
│ 75845Dragonblight     57 │
│  4175Halls of Stone   4 │
│ 40309Undercity        28 │
│ 78122Dalaran          6 │
│ 75845Howling Fjord    33 │
│ 35625Borean Tundra    63 │
│ 67626Shadowmoon Valley12 │
│ 54676Icecrown         22 │
│  2877Orgrimmar        9 │
│      │
└───────┴───────────────────┴────────────┘

This code summarizes how often those characters appear in different zones while leveling up from level 70 to 80. It combines two sets of data based on character names, selects records within the leveling timeframe, groups data by character and zone, and counts the number of times each character was found in each zone.

There is another example table we can join to figure out the Zone information. I’m only interested in two columns, so I’ll filter this further and rename the columns.

zones = ex.wowah_zones_raw.fetch()
zones = zones.select(zone=_.Zone, zone_type=_.Type)
zones
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ zone                  zone_type ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ stringstring    │
├──────────────────────┼───────────┤
│ Durotar             Zone      │
│ The Barrens         Zone      │
│ Silverpine Forest   Zone      │
│ Stonetalon MountainsZone      │
│ Thunder Bluff       City      │
│ Dustwallow Marsh    Zone      │
│ Durotar             City      │
│ Tirisfal Glades     City      │
│ Ashenvale           Zone      │
│ Stranglethorn Vale  Zone      │
│          │
└──────────────────────┴───────────┘

Making use of pivot_wider and joining back to our leveler_zones expression will make this a breeze!

zones_pivot = (
    leveler_zones.join(zones, "zone")
    .group_by([_.char, _.zone_type])
    .agg(zone_type_count=_.zone.count())
    .pivot_wider(names_from="zone_type", values_from="zone_type_count")
)
zones_pivot
┏━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┓
┃ char   Zone   City   Sea    Battleground  Dungeon  Arena ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━┩
│ int32int64int64int64int64int64int64 │
├───────┼───────┼───────┼───────┼──────────────┼─────────┼───────┤
│ 30491143NULL1173 │
│ 54357122NULL1122 │
│ 24239123NULL1143 │
│ 5977813NULLNULL2111 │
│ 71918173NULL1142 │
│ 7355718111183 │
│ 87205141NULL1143 │
│ 31900302NULL1364 │
│   92511NULLNULL1112 │
│ 8626114NULL11142 │
│      │
└───────┴───────┴───────┴───────┴──────────────┴─────────┴───────┘

If they have a high value in the “Zone” column, they were likely questing. Other players opted to venture into dungeons.

Next steps

It’s pretty easy to do complex analysis with Ibis. We churned through over 10 million rows in no time.

Get in touch with us on GitHub or Zulip, we’d love to see more analyses of this data set.

Back to top