Analysis of World of Warcraft data

Introduction

I grew up playing games, and with the recent re-release of World of Warcraft Classic, it seems like a perfect time to analyze some in-game data!

This dataset is the product of a Horde player’s diligent recording throughout 2008, capturing the transitional phase between the Burning Crusade and Wrath of the Lich King expansions. Notably, starting November 13, 2008, the data showcases numerous characters venturing into new territories and advancing beyond the former level cap of 70.

Analysis

We’ll determine who logged in the most, who leveled from 70 to 80 the fastest, and what activities these players engaged with based on zones. Let’s get to work.

Getting started

Ibis ships with an examples module, which includes this specific data. We’ll use DuckDB here, but this is possible with other backends, and we encourage you to experiment. DuckDB is the default Ibis backend, so it’ll be easy to use with this example.

You can execute pip install ibis-framework[duckdb,examples] to work with Ibis and the example data.

from ibis.interactive import *

wowah_data = ex.wowah_data_raw.fetch()

wowah_data

┏━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ char  ┃ level ┃ race   ┃ charclass ┃ zone                   ┃ guild ┃ timestamp           ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int32 │ int32 │ string │ string    │ string                 │ int32 │ timestamp(6)        │
├───────┼───────┼────────┼───────────┼────────────────────────┼───────┼─────────────────────┤
│ 59425 │     1 │ Orc    │ Rogue     │ Orgrimmar              │   165 │ 2008-01-01 00:02:04 │
│ 65494 │     9 │ Orc    │ Hunter    │ Durotar                │    -1 │ 2008-01-01 00:02:04 │
│ 65325 │    14 │ Orc    │ Warrior   │ Ghostlands             │    -1 │ 2008-01-01 00:02:04 │
│ 65490 │    18 │ Orc    │ Hunter    │ Ghostlands             │    -1 │ 2008-01-01 00:02:04 │
│  2288 │    60 │ Orc    │ Hunter    │ Hellfire Peninsula     │    -1 │ 2008-01-01 00:02:09 │
│  2289 │    60 │ Orc    │ Hunter    │ Hellfire Peninsula     │    -1 │ 2008-01-01 00:02:09 │
│ 61239 │    68 │ Orc    │ Hunter    │ Blade's Edge Mountains │   243 │ 2008-01-01 00:02:14 │
│ 59772 │    69 │ Orc    │ Warrior   │ Shadowmoon Valley      │    35 │ 2008-01-01 00:02:14 │
│ 22937 │    69 │ Orc    │ Rogue     │ Warsong Gulch          │   243 │ 2008-01-01 00:02:14 │
│ 23062 │    69 │ Orc    │ Shaman    │ Shattrath City         │   103 │ 2008-01-01 00:02:14 │
│     … │     … │ …      │ …         │ …                      │     … │ …                   │
└───────┴───────┴────────┴───────────┴────────────────────────┴───────┴─────────────────────┘

Getting table info

Let’s learn more about these fields. Are there any nulls we should consider? We can use the info method on our Ibis expression.

wowah_data.info()

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━┓
┃ name      ┃ type         ┃ nullable ┃ nulls ┃ non_nulls ┃ null_frac ┃ pos  ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━┩
│ string    │ string       │ boolean  │ int64 │ int64     │ float64   │ int8 │
├───────────┼──────────────┼──────────┼───────┼───────────┼───────────┼──────┤
│ char      │ int32        │ True     │     0 │  10826734 │       0.0 │    0 │
│ level     │ int32        │ True     │     0 │  10826734 │       0.0 │    1 │
│ race      │ string       │ True     │     0 │  10826734 │       0.0 │    2 │
│ charclass │ string       │ True     │     0 │  10826734 │       0.0 │    3 │
│ zone      │ string       │ True     │     0 │  10826734 │       0.0 │    4 │
│ guild     │ int32        │ True     │     0 │  10826734 │       0.0 │    5 │
│ timestamp │ timestamp(6) │ True     │     0 │  10826734 │       0.0 │    6 │
└───────────┴──────────────┴──────────┴───────┴───────────┴───────────┴──────┘

We can also use value_counts on specific columns if we want to learn more.

wowah_data.race.value_counts()

┏━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ race      ┃ race_count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ string    │ int64      │
├───────────┼────────────┤
│ Undead    │    2530156 │
│ Orc       │     933056 │
│ Troll     │    1102409 │
│ Blood Elf │    3929995 │
│ Tauren    │    2331118 │
└───────────┴────────────┘

We don’t have any missing values, and the data value_counts results match what I would expect.

How about duplicates? We can check the count of unique rows against the total count.

print(wowah_data.count())
print(wowah_data.nunique())
print(wowah_data.count() == wowah_data.nunique())

10826734
10823177
False

So we have some duplicates. What could the duplicate rows be?

We can find them like this.

wowah_duplicates = wowah_data.mutate(
    row_num=ibis.row_number().over(
        ibis.window(group_by=wowah_data.columns, order_by=_.timestamp)
    )
).filter(_.row_num > 0)

wowah_duplicates

┏━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ char  ┃ level ┃ race      ┃ charclass ┃ zone               ┃ guild ┃ timestamp           ┃ row_num ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ int32 │ int32 │ string    │ string    │ string             │ int32 │ timestamp(6)        │ int64   │
├───────┼───────┼───────────┼───────────┼────────────────────┼───────┼─────────────────────┼─────────┤
│   341 │    70 │ Undead    │ Rogue     │ Terokkar Forest    │   204 │ 2008-06-05 01:57:37 │       1 │
│   980 │    70 │ Orc       │ Hunter    │ Isle of Quel'Danas │    79 │ 2008-10-25 05:05:03 │       1 │
│  1321 │    70 │ Tauren    │ Druid     │ Isle of Quel'Danas │     4 │ 2008-04-17 15:38:46 │       1 │
│  2866 │    70 │ Undead    │ Priest    │ Nagrand            │   103 │ 2008-10-23 21:01:32 │       1 │
│  4318 │    70 │ Undead    │ Warrior   │ Karazhan           │    19 │ 2008-07-12 18:05:06 │       1 │
│ 11316 │    70 │ Undead    │ Mage      │ Alterac Valley     │   104 │ 2008-03-22 00:30:48 │       1 │
│ 17774 │    70 │ Blood Elf │ Hunter    │ Shattrath City     │   271 │ 2008-07-12 18:06:12 │       1 │
│ 19598 │    70 │ Tauren    │ Shaman    │ The Mechanar       │   101 │ 2008-06-05 11:10:57 │       1 │
│ 21828 │    70 │ Tauren    │ Shaman    │ Durotar            │    -1 │ 2008-10-23 19:10:31 │       1 │
│ 22484 │    70 │ Tauren    │ Hunter    │ Warsong Gulch      │   315 │ 2008-07-12 18:03:23 │       1 │
│     … │     … │ …         │ …         │ …                  │     … │ …                   │       … │
└───────┴───────┴───────────┴───────────┴────────────────────┴───────┴─────────────────────┴─────────┘

I suspect this data was captured by a single player spamming “/who” in the game, most likely using an AddOn, about every ten minutes. Some players could have been captured twice, depending on how the command was being filtered.

We can go ahead and remove these duplicates.

wowah_data = wowah_data.distinct()

Which player logged in the most?

We mentioned that there was a single player likely capturing these results. Let’s find out who that is.

(
    wowah_data
    .group_by([_.char, _.race, _.charclass])
    .agg(sessions=_.count())
    .order_by(_.sessions.desc())
)

┏━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ char  ┃ race      ┃ charclass ┃ sessions ┃
┡━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ int32 │ string    │ string    │ int64    │
├───────┼───────────┼───────────┼──────────┤
│   182 │ Troll     │ Hunter    │    42770 │
│ 57741 │ Undead    │ Warlock   │    16237 │
│  1384 │ Undead    │ Warlock   │    15878 │
│ 59489 │ Blood Elf │ Priest    │    13977 │
│ 62239 │ Undead    │ Mage      │    13776 │
│ 62446 │ Blood Elf │ Mage      │    13011 │
│ 31184 │ Undead    │ Rogue     │    12019 │
│ 24126 │ Blood Elf │ Warlock   │    11791 │
│ 61105 │ Blood Elf │ Priest    │    11731 │
│ 35072 │ Blood Elf │ Paladin   │    11399 │
│     … │ …         │ …         │        … │
└───────┴───────────┴───────────┴──────────┘

That Troll Hunter that never exceeded level 1 is likely our person with 42,770 sessions.

Who leveled the fastest from 70–80?

At the end of the year, there were 884 level 80s. Who leveled the fastest?

Finding this answer will involve filtering, grouping, and aggregating to compute each character’s time taken to level from 70 to 80.

Let’s start by creating an expression to filter to only the level 80 characters, then join it to filter and identify only where they were level 70 or 80. We’re only concerned with three columns so that we will select only those.

max_level_chars = wowah_data.filter(_.level == 80).select(_.char).distinct()
wowah_data_filtered = (
    wowah_data
    .join(max_level_chars, "char", how="inner")
    .filter(_.level.isin([70, 80]))
    .select(_.char, _.level, _.timestamp)
)
wowah_data_filtered

┏━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ char  ┃ level ┃ timestamp           ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ int32 │ int32 │ timestamp(6)        │
├───────┼───────┼─────────────────────┤
│ 62226 │    70 │ 2008-01-09 22:00:20 │
│ 21828 │    70 │ 2008-01-09 22:00:30 │
│ 62763 │    70 │ 2008-01-09 22:01:06 │
│ 27547 │    70 │ 2008-01-09 22:01:51 │
│  5730 │    70 │ 2008-01-09 22:02:07 │
│ 34216 │    70 │ 2008-01-09 22:09:27 │
│ 40951 │    70 │ 2008-01-09 22:09:32 │
│ 45552 │    70 │ 2008-01-09 22:10:08 │
│ 19481 │    70 │ 2008-01-09 22:10:23 │
│ 19085 │    70 │ 2008-01-09 22:10:23 │
│     … │     … │ …                   │
└───────┴───────┴─────────────────────┘

Let’s use the where option to help with the aggregation.

level_calc = (
    wowah_data_filtered.group_by(["char"])
    .mutate(
        ts_70=_.timestamp.max(where=_.level == 70),
        ts_80=_.timestamp.min(where=_.level == 80),
    )
    .drop(["level", "timestamp"])
    .distinct()
    .mutate(days_from_70_to_80=(_.ts_80.delta(_.ts_70, "day")))
    .order_by(_.days_from_70_to_80)
)

The data is filtered and grouped by character, and two new columns are created to represent timestamps for levels 70 and 80. Then we drop what we no longer need, get the distinct values, and calculate the time taken to level from 70 to 80. Then we sort it!

level_calc

┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ char  ┃ ts_70               ┃ ts_80               ┃ days_from_70_to_80 ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ int32 │ timestamp(6)        │ timestamp(6)        │ int64              │
├───────┼─────────────────────┼─────────────────────┼────────────────────┤
│ 68544 │ 2008-11-27 01:58:13 │ 2008-11-30 16:19:04 │                  3 │
│  1450 │ 2008-11-18 02:46:37 │ 2008-11-21 11:45:35 │                  3 │
│   399 │ 2008-11-18 00:02:40 │ 2008-11-22 20:23:57 │                  4 │
│ 86264 │ 2008-11-20 23:09:17 │ 2008-11-24 16:35:46 │                  4 │
│ 51738 │ 2008-12-03 20:10:00 │ 2008-12-07 11:26:08 │                  4 │
│  1003 │ 2008-11-18 00:29:25 │ 2008-11-22 23:47:06 │                  4 │
│ 40483 │ 2008-11-17 22:35:42 │ 2008-11-21 15:31:59 │                  4 │
│ 88331 │ 2008-12-12 01:47:57 │ 2008-12-16 10:19:23 │                  4 │
│ 86396 │ 2008-12-03 05:04:42 │ 2008-12-07 17:04:43 │                  4 │
│ 86265 │ 2008-11-20 23:28:40 │ 2008-11-24 07:47:08 │                  4 │
│     … │ …                   │ …                   │                  … │
└───────┴─────────────────────┴─────────────────────┴────────────────────┘

This isn’t perfect, as I found a case where there was a player who seemed to have quit in March and then returned for the new expansion. They hit 71 before it looks like their login at 70 was captured later. If you’re curious, take a look at char=21951 for yourself.

How did they level?

Let’s grab all the details from the previous result and join it back to get the timestamp and zone data.

leveler_zones = (
    level_calc.join(wowah_data, "char", how="inner")
    .filter(_.timestamp.between(_.ts_70, _.ts_80))
    .group_by([_.char, _.zone])
    .agg(zone_count=_.zone.count())
)
leveler_zones

┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ char  ┃ zone              ┃ zone_count ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ int32 │ string            │ int64      │
├───────┼───────────────────┼────────────┤
│ 40309 │ Borean Tundra     │         70 │
│ 75845 │ Dragonblight      │         57 │
│  4175 │ Halls of Stone    │          4 │
│ 40309 │ Undercity         │         28 │
│ 78122 │ Dalaran           │          6 │
│ 75845 │ Howling Fjord     │         33 │
│ 35625 │ Borean Tundra     │         63 │
│ 67626 │ Shadowmoon Valley │         12 │
│ 54676 │ Icecrown          │         22 │
│  2877 │ Orgrimmar         │          9 │
│     … │ …                 │          … │
└───────┴───────────────────┴────────────┘

This code summarizes how often those characters appear in different zones while leveling up from level 70 to 80. It combines two sets of data based on character names, selects records within the leveling timeframe, groups data by character and zone, and counts the number of times each character was found in each zone.

There is another example table we can join to figure out the Zone information. I’m only interested in two columns, so I’ll filter this further and rename the columns.

zones = ex.wowah_zones_raw.fetch()
zones = zones.select(zone=_.Zone, zone_type=_.Type)
zones

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ zone                 ┃ zone_type ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string               │ string    │
├──────────────────────┼───────────┤
│ Durotar              │ Zone      │
│ The Barrens          │ Zone      │
│ Silverpine Forest    │ Zone      │
│ Stonetalon Mountains │ Zone      │
│ Thunder Bluff        │ City      │
│ Dustwallow Marsh     │ Zone      │
│ Durotar              │ City      │
│ Tirisfal Glades      │ City      │
│ Ashenvale            │ Zone      │
│ Stranglethorn Vale   │ Zone      │
│ …                    │ …         │
└──────────────────────┴───────────┘

Making use of pivot_wider and joining back to our leveler_zones expression will make this a breeze!

zones_pivot = (
    leveler_zones.join(zones, "zone")
    .group_by([_.char, _.zone_type])
    .agg(zone_type_count=_.zone.count())
    .pivot_wider(names_from="zone_type", values_from="zone_type_count")
)
zones_pivot

┏━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┓
┃ char  ┃ Zone  ┃ City  ┃ Sea   ┃ Battleground ┃ Dungeon ┃ Arena ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━┩
│ int32 │ int64 │ int64 │ int64 │ int64        │ int64   │ int64 │
├───────┼───────┼───────┼───────┼──────────────┼─────────┼───────┤
│ 30491 │    14 │     3 │  NULL │            1 │      17 │     3 │
│ 54357 │    12 │     2 │  NULL │            1 │      12 │     2 │
│ 24239 │    12 │     3 │  NULL │            1 │      14 │     3 │
│ 59778 │    13 │  NULL │  NULL │            2 │      11 │     1 │
│ 71918 │    17 │     3 │  NULL │            1 │      14 │     2 │
│ 73557 │    18 │     1 │     1 │            1 │      18 │     3 │
│ 87205 │    14 │     1 │  NULL │            1 │      14 │     3 │
│ 31900 │    30 │     2 │  NULL │            1 │      36 │     4 │
│   925 │    11 │  NULL │  NULL │            1 │      11 │     2 │
│ 86261 │    14 │  NULL │     1 │            1 │      14 │     2 │
│     … │     … │     … │     … │            … │       … │     … │
└───────┴───────┴───────┴───────┴──────────────┴─────────┴───────┘

If they have a high value in the “Zone” column, they were likely questing. Other players opted to venture into dungeons.

Next steps

It’s pretty easy to do complex analysis with Ibis. We churned through over 10 million rows in no time.

Get in touch with us on GitHub or Zulip, we’d love to see more analyses of this data set.