Primary Key/Unique ID Column Data Type Considerations

Primary Key/Unique ID Column Data Type Considerations

Background

Sequential integer for primary keys/unique IDs has been the default choice in most database systems around the world. It has proven to be effective given the int type selected is large enough to not cause any overflow issues.
UUIDs (Universally Unique Identifiers) have recently become popular, particularly in distributed and federated systems, as they can be generated without a central authority.
We shall also explore a third type - TSIDs (Time-Sorted Unique Identifiers), used by applications like Twitter and Discord (they use Snowflake IDs which has inspired various OSS implementations of TSIDs).

This text provides a walkthrough of the advantages and disadvantages of sequential Integers and UUIDs, while highlighting how TSIDs combine the best of both sequential integers as well as UUIDs.

Note: While I have used certain terms specific to PostgreSQL (e.g., WAL), the concepts discussed remain applicable to other DBMS's as well. Also, although this article may feel text-heavy, it should serve with useful insights and considerations for making informed decisions.

Comparing Sequential Integers and UUIDs as Primary Keys/Unique IDs

Sequential Integers

Advantages

  1. Native support across virtually all SQL databases.

  2. Modest space requirements (int - 4 bytes, bigint - 8 bytes) .

  3. Sequential integers index well due to data locality in the index disk pages; this means that similar records reside together on a disk page.

    • This allows the DBMS to optimize writes as simple "appends" to a disk block of the index.

    • The sequential property also enablesefficient data retrievalsin use caseslike the range or other batch queries.

    • The WAL writes are also likely to be more efficient as Postgres does Full Page Writes (FPWs) after every checkpoint - it is likely that a limited set of disk pages would be written to the WAL.

  4. Another advantage (sometimes underrated) is that they are easy for humans to read - they're also easier to remember and communicate (for e.g., order IDs - 67545, 89833, etc.)

  5. Sequential integers are chronologically sorted (new values always have a larger value than the older ones) which could be useful for debgging or auditing (although, having a creation_time column could be a better option for such purposes?).

Disadvantages

  1. One major problem is in the case of distributed systems, where there is a significant risk of generating colliding sequential integers such as order IDs, employee IDs, across different nodes. In other words, it is practically impossible to generate sequential IDs without a central system to ensure uniqueness throughout the distributed system. Thus, using sequential integers stops being a viable optionwhen a client, different nodes, or multiple leaders of a database cluster are supposed to generate unique IDs without any central authority.

  2. Sequential integers add an enumeration attack vector to the system. If a malicious user gets one ID, they can make accurate predictions about the adjacent and therefore, other IDs as well. However, it should be evaluated if the IDs are worth concealing for a given use case.

  3. Sequential integers can inadvertently leak business metrics, such as total number of orders, invoices, user signups per day. For e.g., by creating a new user at time t to obtain a user ID, and another user at (t+24h) to get another user ID, someone can determine the number of users who signed up on a website during the last 24 hours due to the sequential nature of the IDs.

UUIDs

Advantages

  1. They can be uniquely generated in the client browsers, server-side applications, databases (even in sharded or multi-leader setups), and so on, without requiring a central authority - this makes them a favourable option for distributed and federated systems.

  2. UUIDs are globally unique - collisions are unlikely to occur.

  3. Malicious users cannot predict other UUIDs even if they obtain an existing UUID.

Disadvantages

  1. UUID is a 16 byte binary, which occupies twice the same space of a bigint value, and four times the space of an int.

    • When using UUID primary keys or unique IDs, it results in each index page containing fewer entries compared to a bigint column index. This may lead to less efficient searches with UUID keys as we may end up reading more index pages into the main memory.

    • When we have a narrow table (table with few columns), the index could consume a significant part of the total used storage when we use UUID as the primary key or a unique ID.

    • Although Postgres has a 16 byte UUID data type, support forUUIDs varies across databases. UUIDs stored as strings, such as a varchar, can worsen the issues as they are stored as 36-byte strings in that case (compared to 16 bytes in the uuid type).

  2. UUID v4 - the most commonly used version, generates UUIDs randomly.

    • Random UUIDs can cause inserts into random leaf index pages of a B-tree index, which could result in a higher number of index pages being read into the main memory, making inserts slower.

    • The random generation can lead to frequent index page splits in a B-tree index, as more inserts occur into index pages that are full. This can also cause frequent merges during deletes. Moreover, the frequent splits caused by random inserts also result in a low fill-factor of the index pages.

    • This also eliminates the opportunity for time and physical-location-based optimizations, such as efficiently caching the "hot" pages of the table that can be useful when a range of data is accessed more often than the "colder" parts of the table. This can be done efficiently with sequential integers as we may have to read only a few disk blocks that contain the required range of data. Secondly, the auto-incrementing nature of sequential integer IDs can also enhance CPU prefetching where the CPU anticipates and loads data into the faster memory ahead of time.

    • A higher number of disk pages need to be written to the WAL after a checkpoint due to the random inserts into the B-tree index pages (due to FPW).

    • Random UUIDs are not sortable in a meaningful way.

    • To address the issue of randomness, other UUID versions such as v1, v2, and v7 incorporate a time component, introducing an auto-incrementing characteristic to UUIDs similar to sequential integers.

  3. ULIDs are also sometimes considered as an option for unique IDs, however they are similar to time-sorted UUIDs. In particular, UUID v7 is the more recent addition.

  4. Unlike sequential integer IDs, UUIDs are not human-readable - sequential or not. Using UUIDs also makes API URLs longer and less readable.

TSIDs

What are TSIDs?

  • TSIDs are generated in a time-sortable order, similar to other time-sortable identifiers like ULID and UUIDv7.

  • TSIDs are stored as 64-bit(8 bytes) integers that are unique within a specific set of nodes, rather than being globally unique like the UUIDs.

  • Like UUIDs, TSIDs are generated using their own generation function, whereas database sequential integers rely on a per-table sequence stored in the database itself.

  • They can be stored asbigint in the database, and mapped as Long in the JPA entities.

  • TSID consists of two parts:

    • a 42-bit time component

    • a 22-bit random component

  • The TSID generation algorithm can optionally incorporate node IDs in the random part to guarantee uniqueness across multiple nodes, such as multiple databases or application servers.

  • TSIDs can be encoded as a 13-character string using Crockford base32 encoding, so they have better readability than UUIDs.

Advantages

  1. Since TSIDs are time-sortable, they possess an inherent natural ordering. This implies that the index space and performance benefits of sequential integers apply to TSIDs as well.

  2. The random component makes it difficult to predict other TSIDs based on one, which is unlike sequential integers, and similar to UUIDs. It also becomes practically impossible to determine the number of TSIDs generated during any time period, preventing any leakage of business metrics.

  3. Stored as bigint in databases, they can directly replace a sequential integer column, unlike UUIDs, which are entirely incompatible with integers.

Disadvantages

  1. Allocating the entire 42 bits for the timestamp part allows timestamps to span 140 years from a given starting epoch. Allocating 41 bits provides epochs covering 70 years. While this might not be a concern for today's development, it's still noteworthy.

  2. Considering the structure of the TSID, TSIDs generated within the same millisecond may be out of order depending on the random part of the generated TSIDs. For e.g., if a node receives a request to generated two TSIDs at the same millisecond, then the random part generated for the first request may be lower than that of the second request. In the cases of multi-node generation, clock drift between machines can also cause out-of-order TSID generation. However, this issue applies to time-sortable UUIDs as well.

  3. A mechanism is required to maintain node or machine ID when generating TSIDs across multiple nodes or machines. Moreover, using node ID reduces the number of bits available in the random part of a TSID, increasing the likelihood of a collision.

    • While the probability of collision remains extremely small, there is a potential risk in high throughput scenarios where a single node must generate at least a million TSIDs per second. This could also occur when large blocks of TSIDs are generated before they are actually needed.

    • Despite the low risk of collision, it should still be considered so that it can be managed when required.

Conclusion

When choosing between sequential integers and UUIDs, both of which are quite popular, or TSIDs for primary keys/unique IDs, it's important to consider the specific requirements and constraints of your application. Evaluation of the trade-offs should be done for specific use cases to avoid using a solution that may be excessive or unsuitable for your needs.

References

  1. https://vladmihalcea.com/uuid-database-primary-key

  2. https://www.foxhound.systems/blog/time-sorted-unique-identifiers

  3. https://www.linkedin.com/pulse/should-i-use-uuid-ulid-unique-ids-systems-pablo-puma-ihdoc

  4. https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions

  5. https://www.callicoder.com/distributed-unique-id-sequence-number-generator

Did you find this article valuable?

Support Learning Backend by becoming a sponsor. Any amount is appreciated!