You can pay now or pay later, but either way you pay, nothing is free. Schemas make you pay upfront and at times that hurts. Schemaless is more like paying on credit, it starts out effortless and easy until one day you wake up to a pile of debt. There is no explicit right or wrong here, but understanding the costs and tradeoffs is critical to making good decisions.
What is a schema?
We mostly think about schemas in relation to data models, much like database schemas. Here a schema describes a blueprint for tables, fields, their expected data format and how real world data is mapped to it.
In contrast the concept of schemaless refers to the absence of this blueprint, there are no specified fields, data formats and such.
Optimizing raw data storage
The difference from a user perspective is pretty obvious, but it’s nice to understand what the difference actually means behind the scenes. Let’s look at a quick example.
25 is a number, which may be a whole integer 25, or a floating point number such as 25.000, but it could also be “25” as a two character string. From a storage perspective the difference is interesting:
- 25 – integer can be represented using 8, 16, 32, or 64 bits. In this case if you only need this field to represent a number between 0–128, an 8 bit integer can be used, which uses 1 byte. So you can represent over a million of these in under 1MB without any compression.
- 25 – float can be represented using a 32 or 64 bit floating point number. A 32 bit float uses 4 bytes, so a million of these would require ~4MB without any compression. Floats can represent a large range of numbers but trade off some accuracy (they are approximations).
- 25 – string is where it gets interesting. Technically it can be represented using 2 bytes assuming the chars “2” and “5” use 1 byte each. However, in languages like Java each char uses 2 bytes, plus some overhead. So this string is actually much more than 4 bytes. In other languages this may be less, it varies, but will almost always be a larger overhead than using a numeric format.
So we have shown 3 different ways to store this information (there are actually many more) that each have different memory and disk space overheads. But there is even more nuance here because of the encoding and decoding overhead and evaluation in a query.
So let’s say we want to know if our example value (25) is less than 30 (shown as 25 < 30 below). How does this work for each case? Note: we will ignore encoding speed for these tests and just look at the evaluation.
- Integer 25 < 30 can very quickly look at the most significant bit down (big or little endian dependent) and find which has the highest most significant bit. This is incredibly fast, quick benchmark tests show these operations take 0.5 nsec (Go benchmark).
- Float: 25 < 30 comparisons are slightly more complex. See some reading on floating point comparison to understand why! This is still fast on modern CPUs though as they have been designed specifically for these calculations and also take ~0.5nsec. So here the consideration is more around accuracy, size and encoding/decoding.
- String: “25” < “30” is where things get interesting. If you look at the ASCII encoding for these chars you get:
25 – 00110010 00110101
30 – 00110011 00110000
How about with a type conversion? These are more expensive. In this case about 14 nanoseconds per operation. So ~30x slower than when using a schema. And each conversion creates garbage collection overhead also, so this is even worse.
Performance-wise, using a schema to select how data is managed is helpful. By selecting the right schema type, you can improve data storage and operational speed. But this ignores the cost of making incorrect decisions about these constraints. What if you initially used an integer for your ID field, then later realized some of your IDs have letters? This would be a painful change to make as the integers use less space and totally different encoding, so now you need to literally convert every single ID.
If you’re querying this data at the same time, depending on the software being used, you may need to stop all queries to make this change. Yikes! If only you didn’t have that pesky schema constraint! This is where schemaless is interesting and has taken a totally different approach.
The rise of schemaless
Schemaless has become popular in modern systems because it is extremely easy to get started. “Easy” appeals to a very wide audience. No schema means no planning, no data model to think about, etc. All those headaches are gone — it’s very appealing. Anyone who has scaled a database like MySQL has probably hit issues when making schema changes. There are endless stories of outages from such attempts.
The pros of schemaless are great:
- No data modeling and upfront planning
- No complex schema changes
- Very fast iteration — quick to experiment, build and mock up new solutions
The cons are more nuanced:
- No rules around data can get messy fast, especially with many people involved
- Type mismatches — think numbers stored as strings, etc
- There is a big performance cost for type conversion: garbage creation, extra CPU and RAM usage in general
Do you need a schema?
As annoying as they can be, the power of a schema is significant. The rules are defined upfront and everyone has to play by them. Data can also be strongly typed which has huge efficiency advantages and generally less obscure and sometimes difficult to trace errors.
It’s not just individual technologies that see the upside to schemas also. Company wide schema management via data governance strategies are increasingly common. Some even have data review committees to debate and decide on any proposed changes across all products they use. For those companies their engineers can’t put a new field into production without approval via committee!
A schema is indeed a burden when setting up a product and this is even more so when thinking company wide. So why do people still do this? There are some really good reasons:
- One truth. The schema defines the rules for everyone.
- Less ambiguity when connecting and moving information between systems
- A schema gets the hard questions out of the way upfront (note: this is also a negative as it’s not always possible to know upfront)
- Less bugs and easier to understand code. Generally things either work or they don’t. There is no grey area
The downsides mirror the schemaless pros:
- Upfront planning and data modeling are required
- Schema changes can be annoying
- It’s slower to get up and running initially and more difficult to experiment
Schemas in search engines
Search engine queries are typically very longtail in nature and can also traverse a lot of information, thus efficient caching can sometimes take more resources than is worthwhile. They can also be very complex and/or do many simple operations, so there are significant performance advantages in supporting schemas.
Many developers (perhaps the majority) have grown up using dynamic languages and have never needed to understand the tradeoffs of static type systems at all. This makes schema decisions a barrier to actually getting something done. Search has not been immune to this, but invariably to make things work performantly everything has eventually led back to using a schema.
For search, where millions of data points can be analyzed during a query that returns in just a few milliseconds, it matters. Amazon has reported that a drop in just 100ms due to latency can cost 1% in sales. It’s a huge drop for any business at scale.
Simply put, a schema aligns much better to high performance search.
At Sajari we infer or require a defined data schema. The storage space and query execution for predefined data types is highly optimized, very transparent from an integration perspective and this allows us to pass on the best cost to performance ratio we possibly can.
Whether you prefer schema or schemaless, we think Sajari can work for you. Try a free 14-day trial (no cc required!) and our new onboarding sequence will do the rest.