Learning DynamoDB the hard way

Published in

AWS Tip

12 min readOct 10, 2023

DynamoDB is famous. It is one of the most popular services in AWS with many impressive use cases and customers. When serving Amazon Prime Day 2023, it withstood 126 million requests per second! Yet I haven’t had a chance to use it, until now.

Below is the story about how I applied DynamoDB to my use case. I could avoid some of the wrong decisions early on if I read all the docs upfront. But that’s no fun. And learning while practicing actually lets you remember things better. It was kind of a rollercoaster, but I ended up with a solid design proven by a PoC implementation. I think this post will be most helpful to those who are just starting their DynamoDB journey (like I was).

The goal of this post is not to merely share the result, but to walk the reader through my thought process, gathering lessons learnt along the way. For the most impatient, there is a TL;DR summary in the bottom.

DynamoDB is known by its single-table design

Problem statement and design goals

The task I had at hand is the following. We have a web portal and a mobile app. When the user logs in, they see a dashboard with their data. The dashboard includes a list of notifications, or alerts.

Now imagine an API that serves such list for the given user:

Request:
GET /accounts/{accountId}/notifications

Response:
{
  "notifications": [
    {
      "id": "string",
      "timestamp": "string",
      "type": "string",       // enum
      // other fields
    },
    // other notification objects
  ]
}

A notification has an id, a timestamp, some other attributes, and a type. Type is an enum, basically this field carries the business meaning of the notification.

The API only serves so called active notifications (no history). There can’t be more than one notification of a certain type to be active at the same time. A notification can be dismissed by the user, or it can be removed from the list automatically if a certain event happens. The example would be if you have a “LowBalance” typeof notification, and later when you top-up your account, the notification gets removed from the list.

The notifications come to the API via multiple SNS topics. Business rules are attached to each notification type: should we add such notifications to the list or not, should we remove any other notifications upon receipt (like in the example above), and should we send this notification to the user via their preferred communication channel. The latter part is not shown as our focus here is the database:

A part of the software design was choosing the database technology, essentially that’s how this post was created.

Despite the API doesn’t serve historical data, we still want to store it to run audits and help support agents figure out what was sent to the user.

The expected number of users is millions, and the API will be called by the frontend systems (the mobile app and the web portal), so one of the design goals was to minimise latency. Another goal was to support setting retention period for historical data.

RDBMS option

I wasn’t the only decision maker, so I had to produce multiple design options and present them to the architecture board. Since one of the tech standards in the organisation was Amazon Aurora Postgres, I produced the design option using this RDBMS.

Relational database design for this use case is quite straightforward. Basically I came up with this table:

Simple idea: we store all notifications in a table and manage “active” boolean field to either include it into the response or not. Active notifications have NULLs in “dismissedBy” and “dismissedAt” columns. If a user dismisses the notification, we put “user” and corresponding timestamp there. If a notification is superseded, we put id and timestamp of the superseding event to “dismissedBy” and “dismissedAt” columns accordingly.

It would use a partial index on “active” column, in addition to obvious indexes on “accountId” and “id”.

This approach, being the most straightforward one, have some cons:

We have to host and maintain a database instance.
We have to write clean-up logic in the application as we don’t want to store notifications forever and Postgres (and all other RDBMS I know) doesn’t have expiration support.

In addition to this, we don’t really have any relations in the data — what we have is basically a list. All in all, RDBMS doesn’t look like the best fit here.

NoSQL and access patterns

So I immediately started looking into DynamoDB. Relational databases are designed around data normalisation and supporting flexible querying of data. DynamoDB is NoSQL database. One of the trade-offs it makes is access flexibility against performance, which means that you queries have predictably low latency, but you have to design your table to a set of known access patterns.

I wrote down the access patterns I had:

Add a notification to the list.
Get all active notifications for the given user.
Dismiss a notification by its id (used when the user dismisses a specific notification).
Dismiss a notification by its type (used when a notification is dismissed automatically by another notification).
Retrieve history of notifications for the given user.

It’s worth noting that we can’t have two active notifications of the same type for the given user.

Basic concepts

I hate to repeat what you can read from the official documentation, but it is worth it now as it helps to make sense of the rest of the post.

Any DynamoDB table has a unique primary key. It can be simple (one attribute called partition key or PK) or composite (two attributes: partition key + sort key a.k.a. hash key + range key, PK and SK). As a convention, most of the time you name your attributes exactly “PK” and “SK” and don’t specify what they mean because different items can have different types of values in these attributes (this refers to an approach called single-table design, where you deliberately mixing concepts in one table — it is not the focus of this article though).

You can have additional indexes, there are two types of them: Global Secondary Index (GSI) and Local Secondary Index (LSI). GSI is basically you choose different attributes as your PK and SK, and LSI allows you to index your table using the same PK but different SK.

By the way, there is a mistake in DynamoDB documentation! At least it was when I read it. I reported to AWS but haven’t heard back. Here is what they say:

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.FilterExpression

But it doesn’t work like that. If you specify PK and SK exactly, you can’t have more than one item, unless you query an index — but here in the example no indexes are used. There are similar mistakes on the same page. Maybe the documentation has been created using Generative AI? Let me know in comments if you think it’s a reasonable guess.

Also, it’s important to introduce some of the basic DynamoDB terms here: item and item collection. Item is, well, a record. All items are unique. Item collection is a number of items that have the same PK value but different SK values. The image below is taken from the official documentation and shows 8 items in 3 item collections.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItemCollections.html

Lesson learnt 1

My first approach to designing the table was very straightforward. Basically it followed RDBMS design above. We have an item collection per user with PK like “ACCOUNT#12345” and SK holding id, and a boolean “display” attribute on items. To get all active notifications for the user (access pattern 2) I can query the collection (i.e. all items with the same PK) and filter on “display”.

But the thing is, this filtering will be applied to the whole collection after it is fetched from the storage layer. Performance-wise, it’s not the best option, and things will get worse if the size of your item collection is more than 1MB as you will get a paginated response (i.e. your client app will have to make multiple calls to DynamoDB to flip through all the response pages). Remember one of my design goals was to minimise latency (and consequently the number of requests to the database). Now, lesson learnt 2 is filter operation doesn’t make use of indexes. Again, it’s different from SQL world where you WHERE clause can use various indexes to get you what you need efficiently.

It was expected that naïvely repeating RDBMS design will lead me nowhere.

Lesson learnt 2

I realised I need to move the sign of “activeness” from the “display” attribute into PK or SK to leverage indexing. I thought I could use SK for that. Its value would be id for uniqueness, and I would prefix active notifications with “on_” so I can do Query API call to retrieve all active notifications using key expression of begins_with(SK, "on_"). When a notification needs to be dismissed, I just strip “on_” out of SK value.

And here came lesson 3: you can’t modify PK or SK, or effectively any attribute that is used as a key in the table or one of GSI or LSI.

And that makes total sense if you think about it. Because PK and SK are not just data values. They are used to physically assign an item to a partition, also items are stored physically in SK order, so changing any of them would be very costly, basically it is the same as delete one item and create another, potentially on another partition, so DynamoDB doesn’t want to do it for us to keep its low-latency promise.

Lesson learnt 3

Right, so I can’t change PK or SK. Then I had the bright idea: to maintain active and historical notifications in two different item collections. PKs would be “ACCOUNT#12345#ON” and “ACCOUNT#12345#HISTORY”. This would give me the ability to get all active notifications for the user by just retrieving the whole ON collection (I knew that the number of active notifications can’t be huge so no pagination). Good.

By having two item collections I can now use the power to set different types of values for SK. I want to keep history in descending order, so timestamp is a good value for SK. In contrast, the main access pattern for ON collection is to “remove a notification of a certain type” (access pattern 4 from the list above). It makes sense to use type as SK for ON collection since I know that all active notifications will have different types.

I also knew I want to expire items. So, if an item from ON collection expires, I won’t have a track of it in HISTORY, which is an issue. The answer to that is I have to write to both item collections at once. Lesson learnt: always think denormalisation.

Lesson learnt 4

So far I have a table design that seems to support my main access pattern — number 4. Let’s see how access pattern 3 is laid out with this design. Initially I thought I can have an LSI over id, so dismissing by id would be just DeleteItem operation that makes use of this index. Right? Wrong! GSI/LSI can only fetch items using different indexing pattern, but to modify or delete an item, you still have to know their PK and SK from the table itself. This was a bit of surprise, but that’s how it works. So, to dismiss a notification by its id you need to Query LSI first to get PK and SK values of the table, and follow it with DeleteItem that specifies the primary key. Lesson learnt: you can’t modify/delete an item using GSI/LSI primary key.

Lesson learnt 5

All right, seems I’m getting there. Remember I wanted to minimise the number of network calls. So I looked at batch operations and transactions in DynamoDB. Apparently DynamoDB transactions are very different from RDBMS transactions. It’s clear from the docs that you can’t mix read and write operations in one transaction, i.e. you can’t use read operation’s output as input for write operation. Taking an example above, I can’t Query LSI to get the primary key of an item and delete it in the same transaction.

You can run whatever SQL statements inside a RDBMS transaction, and SQL is quite sophisticated. DynamoDB basically provides CRUD operations on items when you know primary key (it is a key-value database! have to remember that), with some query layer on top.

I will show how I used transactions in the next section.

Conclusion

I arrived with this table design:

Please see below how all my access patterns are catered for in my table design.

My goal was to optimise for latency, and the design addresses it by:

Using DynamoDB that provides low latency operations (in the docs they say “single-digit millisecond” latency).
Fitting all access patters into two API operations for each.
Optimising the number of network calls by using Transactions.
Using TCP keep alive in AWS SDK for .NET (on by default, and yes, the application is written in C#) to reuse TCP connections.

Access pattern 1: Add a notification to the list

I insert two items in a transaction, one to ON collection and another to HISTORY collection.

TransactWriteItem(
  PutItem()  // ON collection
  PutItem()  // HISTORY collection
)

Access pattern 2: Get all active notifications for the given user

Just Query ON collection for the account.

Access pattern 3: Dismiss a notification by its ID (used when the user dismisses a specific notification)

As I mentioned above, I can’t delete an item using LSI. I can only fetch it using LSI. So, to delete an item by id, I have to Query the index first (indexes don’t support GetItem operation) to get the primary key and use DeleteItem / UpdateItem operations.

// Query LSI using id, get partition key and timestamp
// to construct primary key for both collections.
Query()
TransactWriteItems(
  DeleteItem()  // from ON collection
  UpdateItem()  // set dismissedBy = "user" in HISTORY item
)

Access pattern 4: Dismiss a notification by its type (used when a notification is dismissed automatically by another notification)

type will be known from the business rules, so I will know the primary key for ON collection. But I still have to fetch the item to get timestamp that I need to work with HISTORY collection.

// eventType is provided, so we know the primary key for ON collection.
GetItem()       // retrieve the whole item to note eventTimestamp (HISTORY SK)
TransactWriteItems(
  DeleteItem()  // from ON
  PutItem       // new ON item
  PutItem       // new HISTORY item
  UpdateItem()  // set dismissedBy = "superseding eventId" in HISTORY
)

I could save one operation in the transaction by issuing DeleteItem() instead of GetItem() in the first place, and making note of deleted item attributes (DynamoDB supports that). But in this case I would risk ending up in an inconsistent state: when deletion succeeds but transaction fails. It is better to wrap all write ops in a transaction. Also, it would be still two network calls.

Access pattern 5: Retrieve history of notifications for the given user

This is as easy as getting the whole item collection. You can also compare timestamps using key condition expressions: for example to get all alerts sent between certain dates.

Query() // HISTORY collection with optional key condition expression on SK

So, this is it folks! I hope it was interesting for you to read. If you think the design can be improved — please leave a comment!

TL;DR / All Lessons Learnt

Here is the list of lesson I learnt along the way.

Filter operation doesn’t make use of indexes, it is performed using scan after the result set is fetched from the storage.
You can’t modify an attribute that is used as PK or SK in the table or any of the additional indexes (both GSI and LSI).
Always think denormalisation: sometimes writing the data twice in multiple shapes (and thus storing it twice) is better than transforming it on the go when it is requested.
You can’t modify/delete an item using GSI/LSI primary key. You have to Query the index to fetch the item and construct the table’s primary key for modify/delete operation. You can’t GetItem from an index.
You can’t mix read and write operations in one transaction, i.e. you can’t use read operation’s output as the input for a write operation.

You can see the table design I ended up with in the previous section.

Bonus design option

Another option I thought through but quickly discarded. What if I store all active notifications in one item? I know DynamoDB supports vector types (maps and lists), and I knew I won’t have many active notifications at the same time, like 3–5 for an account. That would work for active notifications and I can store history as individual items. Maybe not a bad idea but then you can’t expire active notifications individually… and you have to parse JSONs in the client to add notifications in order… which is a bit ugly. But it still might be relevant for some other use cases.