What I’ve Learned From Using AWS DynamoDB in Production for More Than 3 Years

In this article, I will give you a brief guide on DynamoDB and hopefully help you determine if you need it in your projects

Borislav Stoilov
AWS Tip

--

I worked on a project with a moderate load and we used Dynamo as data storage. Everything else we had was also in AWS and we utilized their services, so didn’t deploy things like Kubernetes or Kafka. Our setup was purely AWS based.

This article is intended for people that already have knowledge of databases and have some understanding of NoSQL. I won’t be covering the differences between SQL and DynamoDB.

DynamoDB main selling points

This is not in any way free promotion for Amazon just a list of things that make this database stand out in my opinion

Fully managed

Fully managed means that the cloud provider is entirely responsible for the well-being of the service. You have zero cost for maintenance and they promise you 99.99% uptime. In case they break that agreement you are owed compensation. Beware that they don’t track their own availability. If there is an outage, you have to catch it with your monitoring tools and then prove to them that they are down, only then you might get something.

Having said that, working with fully managed services is very refreshing since you don’t have to deal with boring dev ops stuff and focus only on the business needs. Also, I had a very bad experience with internal (company hired) dev ops teams that are usually understaffed and everything they do is very slow

Easy to use (sort of)

It has a steep learning curve, but this is not specific to Dynamo. As someone having only SQL experience in the past I had to learn a lot of new things and solve problems differently. This will be true for anyone who has never used NoSQL and is just starting.

Dynamo is a Key-Value NoSQL database. The AWS team decided to make it very restrictive in the way you can define your data. This way it’s hard to get it wrong, but sometimes this might lead to other problems. For the most part as someone who wasn’t sure what they were doing this was a good thing.

Fast and VERY scalable

And I mean very. With the proper data setup, you won’t be able to even scratch the surface of its max throughput. If data is partitioned properly you will gain very good performance for a low cost. This isn’t trivial of course as we will discuss further in this article.

Some numbers from AWS

  • 20ms response time on average
  • 1000 write capacity and 3000 read capacity units per partition per second (this is a lot we will see how these are calculated shortly)
  • No limits on the concurrent connections (this is both true and false, there is a limit but it can’t be reached before first hitting the capacity unit limits)

Optimized for reading

Reading from it is so cheap that we can’t see the price on the charts. The expensive things are writes and this really shows when you configure it for multiple regions. Those are advanced use cases and most applications will be in a single region using very little resource (relatively)

Very good integration with other AWS services

Do you want to dump your data into S3? That is 30 minutes of work and you can even make it a regular cron job.

Do you want data streaming? Dynamo has built-in kinesis integration that will stream all data to any destination you choose.

Do you want to build a serverless application with Dynamo as a database? Again super easy. AWS lambdas can directly access dynamo tables with the right permissions.

Do you need a DB cache? DAX is a service that you just enable and it caches hot elements from the table. You have to pay extra though

Do you need your data replicated across regions? Just configure a global table.

It is a NoSQL DB

In the end, it is NoSQL and it brings all the pros and cons of those databases.

Internal Structure

The structure of any DynamoDB table is very simple compared to its SQL contrapart. It consists of several elements.

Attribute

Attributes are similar to SQL columns. They have different names only because they have different properties and we don’t want to get things mixed up. You can put anything into the attribute at any time. We have no structure imposed as we are schemaless.

Dynamo supports the following data types for its attributes

  • Numeric (N)
  • String (S)
  • Boolean (BOOL) 0 or 1
  • Binary (B)
  • Collections (Sets) — we have Binary (BS), String (SS), and Numeric (NS)

The attribute is kept as simple as possible on purpose and everything else is built on top of it. A single row of the table is composed of attributes and the row is referred to as an Item.

Partition Key (PK) aka Hash Key

When you create a new table you have to specify which attribute will be the partition key and this is mandatory. The data type of this attribute is not enforced but you should use String since it is the easiest to work with. After that, for every entry in the table, you have to specify a partition key.

Partition keys are the most important choice you have to make when designing the table because they are used to partition the data. Similar to the hash key when we search for an element we have to provide its partition key and the engine will do a very optimal hash key search. This is very crucial for the performance of the database. But don’t worry we will discuss how to choose them properly later on.

Sort Key (SK) aka Range Key

Also when creating a table you can specify which attribute will be the Sort Key. When you specify a Sort key the partition/sort key pair is defined. This is similar to SQLs composite primary key in terms that it has to be a unique pair. This key should also be a String because it works best with the search patterns.

It exists so we can perform meaningful queries. When fetching data PK has to be provided as an exact match, however, SK can be partially provided and we can use queries like ‘startsWith’ which are very common.

Capacity units

How much you pay is determined by the usage. In Dynamo you pay for how much you use (idle tables cost you 0). The pricing term they chose is the cost per capacity unit. Capacity units can either be read or write. 1KB write equals one write capacity unit (WCU). 4KB of read equals one read capacity unit (RCU).

You have to be really watchful of these. Some particular operations can consume huge amounts of units, like scans. Few such operations might make Dynamo more expensive than SQL.

Capacity units can either be provisioned or on demand. This is specified upon resource creation and can be changed later on.

Provisioned is used when you have a steady load throughout the whole day. No spikes and no big differences from day to day. This can be cheaper than OnDemand throughput but you have to nail it. If the throughput exceeds the provisioned amount clients get throttled so keep that in mind too.

OnDemand is used when you don’t know what load you are expecting, or your load is not evenly distributed. Typically you start with this one and if later on, you discover that your load is predictable and steady you switch to provisioned.

AWS Dynamo pricing

Queries

Dynamo supports the following query operations

  • equals
  • begins with (only for String)
  • between
  • less than
  • greater than
  • greater than or equal to
  • less than or equal to

All of these can be used only on Sort keys. You might be wondering how exactly those greater than and less than operations work for strings. The answer to this lies in the way partitions are organized

Firstly data is split into large partitions (buckets) based on the PK. Then every partition is topologically sorted based on the SK. Topological sort for strings means alphabetical sort. This means that comparison operators will do lexicographical comparisons.

Queries do loosely consistent reads by default. This means that if there is an ongoing write operation while the query is running it might miss it. You can force strong consistency but it will consume double the capacity units, so use with caution

Indexes

Dynamo supports two types of indexes and they work very differently compared to their SQL contraparts

Local Secondary Index (LSI)

These are specified during table creation. Once created they can’t be changed. The LSI shares the PK of the table and allows you to define only a new sort key. This essentially gives you an extra search pattern. They live in the same partitions in the same table as defined by their values, they also consume resources from the main table partitions. This is especially important if your table is configured with provisioned throughput, too many LSI might exhaust it. Data in these indexes is strongly consistent.

Global Secondary index (GSI)

These can be created or deleted at any time. Essentially they copy data into a different sub-table with its limits. GSIs have their own throughput separate from the main table. It is even possible to have provisioned for one and on demand for the other. You can also specify both new PK and SK, making them very flexible. However, the data in them is eventually consistent.

When GSIs copy data into their tables, you can tell them which attributes to copy or you can tell them to copy everything. This is called GSI projection. Every time you write into the table all GSIs have to be updated, but dynamo will update only those affected and if a GSIs projection is not changed it will be skipped. This way you can save write capacity units. Just keep in mind that when querying GSI, you can get only the attributes that are part of their projection.

Comparison

Scan

The scan is the equivalent of a full table scan in SQL. Scan operations can perform any operation on any field completely disregarding any restrictions mentioned above. It doesn’t matter if the attribute is part of the index or if it is PK/SK. The scan is to be avoided at all costs.

You can run native parallel scans on the Dynamo table, making them faster, but they will still consume the same amount of units. Same as the query they can be loosely (default) and strongly consistent.

The only time Scan makes sense is for small tables that you want to load fully from time to time. If you have to load everything it won’t matter if you will be using scans or queries, so you might as well not bother with the complications coming from the queries

Single Table Design

In SQL we typically create a new table for every entity in our domain. This is good because SQL has data integrity constraints and foreign keys. In the NoSQL world, this isn’t the case. You might be tempted to do the same thing in Dynamo and create a new table for every entity but you will discover that configuring this is a nightmare and you also have to manage the throughput of all tables, so it might even be more expensive in the end. The funny part is that even if you do it perfectly you will still be missing out on the features the single table design provides.

How to put everything into a single table? Simple answer: Key and index overloading. If we name our PK columns with something generic like ‘PK’ we can reuse the same attribute for multiple entity types and then accommodate for the differences in our code. This sounds very counterintuitive at first but this is the way Dynamo is optimized to work. Also renaming attributes, and data migrations in general, is a complete nightmare, so keeping your design as generic as possible will save you a lot of headaches later on. Trust me on this one.

Transactions and batches

One of the more recent changes in Dynamo introduced transaction support. They act very similar to SQL transactions giving us ACID, however, they come with a few major limitations.

  • Max 25 items per transaction
  • Consume 2x capacity units
  • Can have only one operation per element in the transaction

Those are some harsh limits if you think about it, especially the first one. This forces you to split your operations into smaller chunks and see which are idempotent and remove them from the transaction. Also overusing them will make your bill go up, so my advice is to avoid them if you can. Don’t build your logic around transactions, rather use them if there is no other choice.

Batches on the other hand are a very simple concept. Similar to transactions they can contain no more than 25 elements, but they just return a list of operations that failed to complete. They are a more optimal way of writing items in bulk and can save some capacity units, but having to deal with their limitations might not be worth it at times.

Conditional writes and atomic operations

One of the best ways to avoid transactions in dynamo is to use the built-in optimistic locking mechanism of the database. Conditional write/update performs the operation only if a certain condition is met. For example, you can introduce a version and increment it every time someone updates the row. And then when performing any operation do it conditionally based on the last version you know, if someone changes the version while your procedure is running the DB call will fail and you can retry the whole thing. This will solve 99% of your concurrent write problems thus eliminating the need for transactions.

There are few atomic operations in Dynamo, one of them for example is Atomic Counter. Essentially you have the ‘+=’ operator built into the engine. You can increment a numeric attribute without caring for its previous value. I haven’t used these since they are not supported in the Java SDK, just know that you have this tool in the toolkit when making decisions.

Limits

Dynamo provides crazy response times and to keep them, it has to impose some limits. Some of the limits I already mentioned above, but it’s worth having them in a consolidated list

Item size limit

A single Item (or row/record) in the table should not exceed 400KB. This is very small, for example, this limit in MongoDB is 16MB. A small size limit is imposed because Dynamo is optimized for small and many operations.

In case your item exceeds the size limit you can

  • compress the data into a large binary attribute
  • store the data in S3 and have a link to it in the item (this is for huge items)

Automatic Paging

When the response from any operation exceeds 1MB the data is paged. This is the first time I see my response being paged based on size and not item count. Both Scan and Query can have their data paged. They usually return a ‘bookmark’ and you pass it as a parameter to fetch the next pages.

Hard upper limits on partitions

You can’t have more than 10GB of data in a single partition. As we mentioned partitions are defined by the partition key. This further increases the importance of the PK choice.

Furthermore, there is a max throughput a partition can handle. If your load is unevenly distributed across partitions you might find that some particular clients are throttled and others not. These are hard to find and debug.

Summary of the API limits

  • Max 5 LSI
  • Max 20 GSI
  • Max 25 items per Batch/Transaction

Read about limits from the source

Building an imaginary use case

Dynamo is best explained with examples. This is why we will be creating an example business use case and we will try to define our table around it.

We are asked to build software for bookkeeping. It will be used by all schools in the country so we expect a huge load. Everyone should be able to submit a book and will provide some meta information that includes the author, year published genre, and title. Schools will be interested only in their own books for the moment.

Few of the requirements the system should cover

  • Search by author
  • Search by author and title
  • Search by genre, results should be ordered by year published

What will be our thought process? Let’s start thinking in terms of partitions. Our system will be centralized, this means that ALL schools will be storing data in the same database. Furthermore, we will be using the Single Table design described before to store it. To ensure somewhat even distribution of the data we can use the School as a PK. Every school will have its ID and all books coming from that school will be stored in its partition.

So far so good. Now how are we going to search by author, without using scans of course? For that, we will utilize the sort key. If our SK starts with the author we will be able to do startsWith query.

OK but what about the author + title search? You might have already guessed that we can include the title in the SK and do two kinds of searches using ‘beginsWith’ author only and author + title.

For the third one, we need some sort of comparison query and since we are working with strings this might not be that trivial. Also, we can’t include the genre in our already occupied SK. Our only option here is to use an index. Since I don’t care about consistent writes I will be using a GSI. Our new GSI will have genre as its PK and year published as its SK. Now it will be easy to fetch all books for a given genre, but what about the order you might ask. If you were paying attention at the beginning I mentioned that the data in the partitions are topologically sorted based on the sort key. This means that our data is already sorted by our SK and when we fetch it we don’t need to do anything else.

The table

Notice the names I’ve used. They have absolutely no meaning related to the data they point to.

  • PK and SK attributes that will be the partition and sort key
  • GSI1PK and GSI1SK attributes will be the SK and PK for the GSI. I number them so I can easily add more in the future
  • Book Data -> We will store all of the book data as a JSON

This is how the data will look like

Notice how I prefix everything. It is never a good idea to use the raw value without the prefix. Using prefixes like ‘SHCOOL_ID’ and ‘AUTHOR’ will avoid collision with other entities in the table. Also, you have to use some sort of delimiter when listing multiple values in a single cell. I am using ‘#’ because it is one of the first symbols in the ASCII table, right after ‘!’ and ‘“’ which are special symbols in dynamo. This way I am sure that it will not affect the topological sort even if it is the last symbol in the string.

Search by author

Query(Main Table) begins with (PK: SCHOOL_ID#12, SK: AUTHOR#hemingway)

Search by author and title

Query(Main Table) begins with (PK: SCHOOL_ID#13, SK: AUTHOR#hemingway#green-hills-of-africa)

Search by genre

Query(GSI1) Equals (PK: horror)

This is just an illustrative example but there are a few things to note.

  • we have to reuse GSIs as much as possible (GSI overloading)
  • putting multiple types of entities in the same table allows us to select them all at once (similar to join)
  • try fitting multiple search patterns into a single key

Auto clean-up of old elements TTL

Dynamo has built-in TTL (time to live) support. You can specify which attribute will be used as a TTL mark and then the internal job will go and clean them automatically for free. The TTL is always in seconds.

Utilize this as much as possible. If something doesn’t need to be deleted right away just set its TTL to 1 second in the future and forget about it. This can save you huge amounts of capacity units depending on what you are building.

Use case from our system. We have to store an event log that tracks the IDs of the events that are already processed. This is needed so something doesn’t get processed twice. We know that 3 days is enough to keep these processed IDs so we set TTL to 3 days in the future.

Data streams

A very powerful and useful feature of DynamoDB is that it can emit events for everything that happens inside it.

The typical setup for this is to create a lambda and subscribe it to the dynamo stream. Then this lambda can handle and process events and then push them into queues so they are processed by services.

You can do very powerful things with this setup

Transactional behavior

Again. I know... If you need to make sure that an operation composed of two tasks is successful, for example, save an item in DB and then send some HTTP call, there is no easy way to do this. The DB save can pass but the HTTP call might fail if your DB save is not idempotent you will have to do ugly ifs to check if the same operation hasn’t already been completed, before retrying.

To avoid this we can save two items in the DB (in a transaction), one is the item from above and the other will be just a trigger that will emit an event via the Dynamo stream. Then we can subscribe to that event and perform the HTTP call separately. This has two benefits, we achieved the transactional behavior, and secondly, we decoupled our logic.

Send updates to other services

We can make the dynamo lambda push events to as many queues as we like. This is an easy way to implement an event-driven system.

Data migrations

NoSQL data migrations are tough because we don’t have proper transactions and schemas. One of the most recommended ways of doing this is via temporary lambda

  • Create a New Table V2
  • Create a Temp Lambda
  • Subscribe the lambda to the original Table V1 stream
  • Make a small change to every element in the Table V1 (migrate = true)
  • Now The stream will send the full table to the temp lambda
  • Temp lambda can save everything in Table V2 with some data transformation logic
  • Switch service to Table V2
  • Delete temp lambda
  • Delete Table V1

This sounds (and it is) very complicated but it is one of the few solutions we found that has a 0% chance of placing your application in an invalid state or corrupting production data without any recovery options

Old but nice discussion about data migrations in Dynamo

Monitoring

Cloudwatch. Not much to say here. I deleted everything I had in this section and will only link you to the AWS docs since they did a good job at explaining what you should monitor and how.

Final thoughts

Dynamo is a great DB. At first, I missed the good old SQL with its easy-to-use transactions and migrations. But then I had to do big data with SQL and I hated every second of it. Did you know that SQL indexes stop working after a certain data threshold? I didn’t either, I found out about this on production and it was one of the worst days in my life :D.

NoSQL is THE technology of choice when dealing with large amounts of data. If you are already using AWS there is no point in using any other NoSQL solutions. I know people that host MongoDB in AWS and I still don’t know why.

DynamoDB is not cheaper than SQL, it is more scalable. Don’t go with it to save money, go with it so you can scale indefinitely. Your problems should require it.

If you are just starting with dynamo there is a great book by Alex Debrie. It is a must-read before starting. It also describes all of the things I listed in this article in more detail and better than me :)

I’ve put a lot of information in this post, I hope it wasn’t too much. Anyways as usual I hope it was helpful.

--

--