Game servers: UDP vs TCP

When writing networked games, the question of UDP vs TCP will eventually come up.

Typically you will hear people say things like: “Unless you’re doing action games, you can use TCP” or “You can use TCP for your MMO, because look at WoW – it uses TCP!”

Unfortunately, these opinions don’t properly reflect the complexity of the TCP/UDP question.

 Background

First off, let me state that my background is mainly TCP programming. I worked for years on a leading poker network’s game servers and we’d typically run 4,000 – 10,000 connections on each server instance during peak (with multiple instances running on a single machine) without any problems. From my point of view, TCP is the safe and well-known alternative.

Despite that, our current project is using UDP, and there is no way we could have it work well with TCP. In fact, it started out with TCP, but when it became obvious that we couldn’t get connection quality we wanted, we switched to UDP.

What TCP means in practice

In theory, the advantages of TCP are things like:

  • Straightforward persistent connections
  • Reliable messaging
  • Arbitrarily sized packets

Anyone with hands-on experience with TCP knows that a solid implementation needs to handle many not-so-obvious corner cases, such as disconnect detection, packet congestion due to slow client response, various DoS attack vectors relating to establishing connections, blocking vs non-blocking IO etc.

Despite the up-front ease of use, a good TCP solution isn’t easy to code.

However, the most damning property of TCP is the congestion control. Basically TCP interprets packet loss as a result of limited bandwidth, and throttles packet sends.

On 3G/WiFi on packet loss you want the replacement packet to be sent as soon as possible, but the TCP congestion control actually does the reverse!

There is no way to get around this, this is just the way TCP works on a very fundamental level. This is what can push a ping up to the 1000+ ms range on 3G or WiFi due to loss of a single packet.

Why UDP is “hard”

UDP is both easier and more difficult than TCP.

For example, UDP is packet based – which is something you’ll actually have to roll yourself for TCP. You also use a single socket for communication – unlike TCP which require a socket for each connected client. These things are mostly good stuff.

However, for most situations you actually need some concept of a connection, some rudimentary ordering and often also reliability. Neither of those are offered by UDP “out of the box”, while you get it for free with TCP.

This is while people often recommend TCP. With TCP you can get started and don’t worry too much about those things – not until you start having 500+ simultaneous connections anyway.

So yes, UDP doesn’t offer the whole kit, but as we’ll see, that’s exactly why it’s so great. In a way, TCP is to UDP what something like Hibernate is to writing your queries by hand in SQL.

The flawed case for TCP

People often give the advice to go with TCP on the idea that “TCP is just as fast as UDP” or “successful game X is using it, so it works”, not really understanding why it works in that particular game, and why UDP isn’t about about regular packet delivery speed.

So why does World of Warcraft work with TCP? First of all we need to rephrase that question. The question should be “why does World of Warcraft work despite the occasional 1000+ms delay?”. Because that is the reality of TCP – on dropped packets you’ll get huge lags as TCP first needs to detect the missing packet, then resend the packet all while cutting down throughput.

Reliable UDP will also have a delay, but since it’s a property of whatever protocol you write on top of UDP, it’s possible to reduce delays in many ways – unlike TCP, where it’s rolled into the TCP protocol itself and can’t be changed.

[At this point, some people will start talking about Nagle’s algorithm, which is pretty much the first thing you disable in any TCP implementation where latency is important.]

So why does World of Warcraft (and other games) work with these delays?

It’s simply because they’re able to hide the latency.

In the case of World of Warcraft, there are no player-to-player collisions: such collisions can’t be handled reliably predicted – but player-to-environment can, so the latter works fine with TCP.

Looking at combat in WoW, it’s easy to realize that commands sent to the servers are really something along the lines of attack_entity(entity_id) or cast_spell(entity_id, spell_id) – in other words, targeting is position independent. Furthermore, things like starting the attack motion or spell effect can be allowed to start without first getting confirmation from the server by showing a “fizzle” effect if the server response differs from the client prediction.

Starting an action before confirmation is a typical latency/lag hiding technique.

A few years back I wrote the client for a card game called Five Card Jazz. It was http based – which latency-wise is a lot worse than a plain persistent TCP connection.

We used the simple card draw and flip up animation to hide latency so that delays were only apparent in the case of very poor connections. The method was typical: send the request and start the animation drawing cards from the deck, but wait with the final flip up to reveal the cards until the server response arrived. WoW’s battle effects work in a similar manner.

This means that the choice of TCP vs UDP should basically be: “Can we hide latency or not?”

When TCP doesn’t work

A game running TCP either needs to be able to work well with occasional lags (poker clients typically, do – an occasional one second lag isn’t something people will get annoyed about), or have good latency mitigation techniques.

But what if you’re running a game where you can’t really apply any latency mitigation? Player vs player action games often fall into this category, but it’s not confined to action games.

An example:

I’m currently working on a multiplayer game (War Arcana).

During typical play, you quickly move your character over a world map initially covered with a fog of war, but which is progressively revealed as you explore.

Due to certain game rules and to prevent cheating, the server can only reveal information about the character’s immediate surroundings. This means that unlike WoW, it’s not possible to fully complete the movement until the server response arrives. What makes this a hard problem, compared to the card reveal of Five Card Jazz, is that we’re allowed a latency of max 500 ms before movement feels sluggish.

When prototyping this, everything worked fine as long as everything was on the same LAN, but as soon as we went to WiFi, the movement would randomly stutter and lag. Writing a few test programs showed the WiFi occasionally dropping packets, and every time that happened, server response time shot up from 100-150 ms to 1000-2000 ms.

No amount of tweaking of TCP settings could get around this issue.

We replaced the TCP code with a custom reliable UDP implementation which cut the penalty of a lost packet down to an additional 50 ms(!) – less than the time of a complete roundtrip. And that was only possible due having complete control of the reliability layer on top of UDP.

Myth: Reliable UDP is TCP implemented poorly

Have you heard this said: “Reliable UDP is just like TCP, so use TCP instead”?

The problem here is that this statement is false. Reliable UDP is unlikely to implement TCP’s particular brand of congestion control. In fact, this is exactly the biggest reason why you use reliable UDP instead of TCP – to get rid of its congestion control.

Another important point is how the “reliable” part of “Reliable UDP” works. There are many possible variants. I really like many of the ideas of the Quake 3 networking code which inspired the War Arcana UDP protocol.

You can also use one of the many UDP libraries that support reliable UDP, although the reliability layer might be more general and as such a bit less optimized than a hand-rolled implementation could be.

The bottom line

So UDP or TCP?

  • Use HTTP/HTTPS over TCP if you are making occasional, client-initiated stateless queries and an occasional delay is ok.
  • Use persistent plain TCP sockets if both client and server independently send packets but an occasional delay is ok (e.g. Online Poker, many MMOs).
  • Use UDP if both client and server may independently send packets and occasional lag is not ok (e.g. Most multiplayer action games, some MMOs)

These are mixable too: Your MMO client might first use HTTP to get the latest updates, then connect to the game servers using UDP.

Never be afraid of using the best tool for a task.