Lets talk about Pagination

6 August 2019 — 9 min

Just recently, I have had to tackle some quite challenging problems related to pagination. And during some discussions I noticed that some concepts are not yet clear to some engineers. So I will try to explain all of these.

I will focus on both page-based, as well as cursor-based pagination, and explain how both methods work. I will also focus on some common questions an API consumer might have, and how easy it is to implement, with a special look at SQL. And lastly, I will look at how these pagination methods behave when the data itself changes in certain ways.

# Prerequisites

To make any kind of pagination useful at all, we need to have a sorted list of entities with some kind of stable sorting order. We do not want the entities to randomly re-order, unless we explicitly change entities inside of that list.

Lets try to visualize such list, so behold my awesome ascii box drawing skills:

E = Entity, K: Sort Key, W: Pagination Window, C: Cursor
   ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
E: │ A│ B│ C│ D│ E│ F│ G│ H│ I│ J│ K│
   ├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
K: │ 2│ 3│ 5│ 7│ 9│10│10│15│20│28│99│
   └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
          └ W ┘  ↑
                C:9

So here we have 11 Entities, sorted by their Sort Key. And we a view into that list, which I call the Pagination Window. I will also talk about a left-to-right traversal order, so anything left in the drawing can also be called before or front, and likewise after or end for the things on the right. Also, to make things easier, I will assume a Pagination Window of 2 elements. And we assume that we are walking this list from front to back.

We also have some meta information that we want to know about, as stated above.

Are there any items before the pagination window, and how many?
Are there any items after the pagination window, and how many?
Can I jump to an absolute or relative position in the list? Absolute here means to the item at position N, and relative means to the item with key K.

I will also consider a few mutations to the list, such as:

Adding a new entity in front of the pagination window
Removing an entity before the pagination window
Moving an entity

As we are walking the list front to back, adding/removing new entities to the back of the list are neither observable, nor do we really care about them. Unless its a move.

# Page-Based Pagination

Implementing this kind of pagination is very simple, you just split the list into equally-sized slices called pages. Obviously, the pagination window is the page size. Depending on the implementation, you might also chose to allow an arbitrary offset, but for simplicity we will not.

┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ A│ B│ C│ D│ E│ F│ G│ H│ I│ J│ K│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
│ 2│ 3│ 5│ 7│ 9│10│10│15│20│28│99│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
 └ 1 ┘ └ 2 ┘ └ 3 ┘ └ 4 ┘ └ 5 ┘ └ 6 ┘

Answering how many items are before the pagination window is trivial: It is exactly (p - 1) * w where p is the page number and w is the page size.
Answering how many items are after the pagination window is trickier, as we need the total number of items, N: N - (p * w).
There is one neat shortcut if you only care about if there are any items after the pagination window. Just select w + 1 items, and if the resulting slice has more than w items, you know there are items following the pagination window. Just remove that superfluous item and you are done.
Jumping to an absolute position in the list is trivial, it is on page floor(p / w) + 1, assuming zero-indexed position p.
Jumping to a relative position is non-trivial, it basically requires you to do a binary search, usually in O(log n) time.

# Changing Data

One problem with page-based pagination is how it reacts to changes in the data.

Lets look at insertion and deletion before the pagination window.

┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ A│ B│ C│ D│ E│ F│ G│ H│ I│ J│ K│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
│ 2│ 3│ 5│ 7│ 9│10│10│15│20│28│99│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
             └ W ┘
  ↓ inserted here
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ X│ A│ B│ C│ D│ E│ F│ G│ H│ I│ J│ K│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
│ 1│ 2│ 3│ 5│ 7│ 9│10│10│15│20│28│99│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
                   └ W ┘

As you can see, we moved one page right, but the whole list was shifted by one position, which means we see F twice. Not good, not terrible.

┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ A│ B│ C│ D│ E│ F│ G│ H│ I│ J│ K│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
│ 2│ 3│ 5│ 7│ 9│10│10│15│20│28│99│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
  ↑          └ W ┘
  └ removed this
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ B│ C│ D│ E│ F│ G│ H│ I│ J│ K│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
│ 3│ 5│ 7│ 9│10│10│15│20│28│99│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
                   └ W ┘

Again, the whole list was shifted, but in this case we have a more severe problem, since element G was skipped completely.

# Implementation

Implementing page-based pagination is really trivial, both in memory, as well as in SQL, as you have the dedicated syntax LIMIT / OFFSET exactly for this use case. However, jumping to a relative position is complex.

# Cursor-Based Pagination

With cursor based pagination, you have a Cursor that points to a position on the sort axis. With cursors, you can usually select items before and after the cursor, since we focus on left to right traversal, we will only consider items after the cursor.

Navigating using cursors is possible when in addition to the item itself, we return the cursor of the item as well. In the next step, we set our new cursor to the cursor of the last item, and query again.

One popular example of cursor-based pagination is the relay connections specification which is popular when dealing with graphql.

# Unique Cursors

One important restriction cursor based pagination has is that the cursor itself needs to be unique! Why? Lets demonstrate with an example.

┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ A│ B│ C│ D│ E│ F│ G│ H│ I│ J│ K│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
│ 2│ 3│ 5│ 7│ 9│10│10│15│20│28│99│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
           ↑ └ W ┘
          C:7
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ A│ B│ C│ D│ E│ F│ G│ H│ I│ J│ K│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
│ 2│ 3│ 5│ 7│ 9│10│10│15│20│28│99│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
                 ↑  ↑ └ W ┘
                 C:10

Here, first we were at cursor 7, and selected 2 items after it. No problem so far. The last item F has the cursor 10. Lets move one step further, selecting 2 items after F, using its cursor 10. Now we have the problem that we skipped G, because its cursor is duplicated to the one of F.

So depending on your use case, if the property you want to sort by is not unique, such as a typical modification time, you have to combine it with another property that is, such as uuid.

# Getting at metadata

Answering some meta questions becomes a lot more complicated with cursor based pagination.

Knowing how many items are before or after the pagination window requires you to actually count them.
Depending on how the cursor itself is implemented, it might be easier to count all the items and use math: before = all - after - w.
The same shortcut to know if there are any items after the pagination window also applies to cursor-based pagination. Just select w + 1 elements, and if the result set has more than w, there are items after the pagination window.
I also found a shortcut to know if there are any items before the cursor. Here we rely on the fact that the cursor itself needs to be unique. So instead of selecting only items following the cursor, I use an inclusive selection which might also return the item that exactly matches the cursor. If it does, I know there are items before the cursor. Also make sure to remove this superfluous item. If there is no item matching the cursor exactly, you have to fall back to counting however.
Jumping to a relative position in the list is essentially the same as normal cursor based navigation. Your cursor just happens to be the relative position you want to jump to.
Jumping to an absolute position however is more complex. I think it can be done with a binary search, but each iteration would need to answer the question what is my absolute position, which itself is not a trivial question as shown above.

# Changing Data

As the cursor denotes a relative position inside the list, it is immune to changes in the data. Even removing the element under the cursor is not a problem.

┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ A│ B│ C│ D│ E│ F│ G│ H│ I│ J│ K│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
│ 2│ 3│ 5│ 7│ 9│10│12│15│20│28│99│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
             └ W ┘
                 ↑ removed this
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ A│ B│ C│ D│ E│ G│ H│ I│ J│ K│
├──┼──┼──┼──┼──┼──┼──┼──┼──┼──┤
│ 2│ 3│ 5│ 7│ 9│12│15│20│28│99│
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
               ↑└ W ┘
              C:10

Since the cursor of F is 10, we can still query everything after 10.

# Implementation

Since cursors themselves give a relative position inside the list, implementing this in-memory requires doing a binary search.

Implementation in SQL is also quite simple, at least for a simple non-compound cursor. You just apply a simple WHERE condition. I have found this stack overflow post to be a really good in explaining how to deal with compound cursors.

In general, after implementing cursor based pagination, together with the optimizations explained above, I can say it is a lot of work, with a lot of details you have to pay attention to.

# Conclusion and Recommendations

Lets repeat some important things.

First of all, both approaches require you to have a stable sort order. Furthermore, cursor-based pagination additionally needs to have a unique sort key for each entity, that you also have to expose.

So for this reason, good performance requires that you either have the data pre-sorted, or have some kind of index, especially in the case of SQL.

This also means that sorting by some computed property or one that is not indexed can result in bad performance.

Cursor-based pagination makes relative jumps easier, while page-based pagination makes absolute jumps easier.

Both methods make it easy to implement the typical previous / next navigation. But cursor-based navigation makes it more resilient to changing data.

Like always, it is a matter of tradeoffs and use-cases.

Do you want to support absolute jumps, or page links, go with page-based.
Do you want to support relative jumps, go cursor-based.
Do you want to have more consistent data when having typical previous / next, buttons, use cursor-based pagination.
Do you want something that is both easy to use and to implement? And also possibly more performant? Go page-based.

And also summarize the problems around changing data:

Considering left-to-right traversal, both are fine with appending data.
Removing data can lead to skipped items when using page-based pagination.
Prepending data can lead to double processing of items when using page-based pagination.

Considering these limitations, when you are dealing with prepend only data, starting at the front again once you finished traversing should make sure you get all the data.

After implementing both page-based and cursor-based pagination, and now that I have written down all my thoughts about it, I do feel a bit more comfortable with the choice we made for cursor-based pagination. But for one reason only: the data we return depends on the current time, so it can have unpredictable data removal, which can lead to missed items when using page-based pagination as shown above.

Otherwise I would probably go with page-based pagination, for reasons of performance and ease of implementation. Which also reminds me to rant about the fact that cursor-based pagination is really complicated to implement using SQL.

One thing we still struggle with is performance. Not because of the pagination method we chose but rather that our sort order depends on a non-indexed compound value. Meh. So we come full circle to the matter of caching and cache invalidation :-)