Demystifying table rotation, extension, and table cleaner

Overview

This guide is written by the ServiceNow Technical Support Performance team (All Articles). We are a global group of experts that help our customers with performance issues. If you have questions about the content of this article we will try to answer them here. However, if you have urgent questions or specific issues, please see the list of resources on our profile page: ServiceNowPerformanceGTS

One of the areas of the ServiceNow platform which doesn't appear to be particularly well understood (or documented) is table rotation. This article will attempt to demystify this concept as well as discussing its relationship to table cleaner. In some cases this might help improve the performance/scalability of new ServiceNow implementations and, even if this doesn't apply directly to the reader, its always good to understand how things work under the covers.

Firstly lets explore the problem that table rotation attempts to address:

Certain parts of the ServiceNow platform (think of logging as an example) tend to create huge data sets
All log messages are all written to a single table (syslog)
Its reasonable to assume that users might actually want to query this data to troubleshoot issues

This, however, creates a potential issue - relational databases don't tend to perform well with huge data sets:

A SQL query executed against a table containing millions (or even billions) or rows is likely to be slow
A good database indexing strategy can help with this but only to a point - even with really good indexes data sets eventually grow so large that query performance still becomes frustratingly slow

Its reasonable to think that a simple solution to this is just to delete old data to keep tables to a manageable size - here too, however, there are pitfalls:

Certain data (think of email as an example) simply cannot be deleted - many customers have legal requirements meaning that this data has to be stored for multiple years
Relational databases simply aren't that good at deleting rows of data - depending on how this is accomplished it can cause:
- Database tables to be locked for periods of time (an instant outage in the case of certain tables)
- Rows simply to be marked as deleted (and not physically removed from underlying storage). Over time this causes fragmentation of tables impacting query performance and a slow but inexorable increase in database size

How does table rotation help to deal with these issues:

Table rotation is configured via records in the sys_table_rotation table (accessible via 'System Definition -> Table Rotations'). In this table there will be a row for every table in the platform against which table rotation has been configured. One important field is 'Type' - this can be set to 'Rotation' or 'Extension' - for now lets consider what 'Rotation' means and come back to 'Extension' a little later.

An example of an out of box table against which rotation is configured is syslog - if you open this record you will see something like the following:

Lets expand on what this means:

The platform has a logical table named syslog which, via the user interface, you can interact with as you would expect
This logical table, however, has been configured as type 'rotation' with 8 'rotations' and a duration of 7 days
This means that instead of there being a single syslog table in the underlying database there are in fact 9 tables (the base table plus 8 'rotations') - note that within ServiceNow we often refer to these rotations as 'shards' or 'subtables'

Details of all the physical table used in a rotation are held in the 'sys_table_rotation_schedule' table - lets look at this more closely again using the example of 'syslog':

The first thing to note is that the base table (syslog) generally isn't used as part of the rotation so normally wouldn't be expected to hold any data, i.e.:

It is used, however, as a 'template' for how each of the shards are structured. To explain further - lets decide that a change needs to be made to the syslog table (i.e. increasing the length of a field). An administrator visits the user interface, makes the required change, and takes a well earned break expecting that their work is complete. Note, however, that it isn't:

The structure of the base table (syslog) will have been modified
The structure of all of the shards (syslog000[x]) will still be as they were
Any data written to the shards will therefore not be able to take advantage of the changes to the base table which might have unexpected consequences!

To enforce the change to the base table across all of the shards the administrator has to also visit the corresponding 'Table Rotation' record and click the 'Synchronize shards' UI action which then propagates out the changes.

The next thing to note is that each shard has 'valid_from' and 'valid_to' dates and times. These dictate when the platform will physically write data to each of the tables. For example:

If the current date is Thursday 5th November 2020 then, right now, the platform will be recording log messages in shard syslog0003 (as the current date/time is covered by its valid_from/valid_to fields)
Lets say we want to look at log messages from Monday 12th October 2020 - again using valid_from/valid_to its clear that these will exist in shard syslog0007

Finally each shard has a field called 'offline' which is set to true or false - this dictates whether data from that table can be queried via the user interface. In the above example records from shard syslog0005 will not be able to be viewed via the user interface as that shard is considered offline (note that if its absolutely necessary to retrieve data from an offline shard support may be able to do this via the back end database).

What all of the above means is that when table rotation is configured:

A number of shards will be created (depending on the 'rotations' defined in the sys_table_rotation record)
Each shard will be given a valid_from and valid_to date/time
The platform inserts data into a specific shard depending on the current date/time and the valid_from/valid_to of each of the shards
As a result data is logically distributed across shards depending on the date/time each individual row was written by the platform

The final piece of the puzzle is what happens when the rotation runs out of shards - i.e. using the above example above the last valid_to date/time is 18th November - what happens on the 19th? In summary when the platform moves between shards two things happen to the oldest (and offline) shard:

All of its data removed such that it is empty
Its valid_from/valid_to dates will be set to the next week in the future

By doing this table rotation always stays one shard ahead of where its currently writing meaning that switching to the next shard (when that date/time arises) is seamless.

Now we know how rotations physically work lets talk about potential benefits:

Query performance: Table rotations use of shards means that data will be logically distributed across a number of physical tables - this can give significant benefits

Lets consider that our syslog table contains 80m records (a whole lot of data) - a single query running across all of these records might therefore be expected to be slow.
Each shard, however, will likely contain around 10m records (a much smaller set of data) meaning that query performance is potentially much improved assuming SQL queries only target a single shard

Data deletion: As touched on previously deletion of individual rows of data has various pitfalls. Table rotation, however, only ever deletes a whole shard (or tables) worth of data at once

It doesn't need to do row by row deletion and instead can just truncate the corresponding shard
This is a far better approach and avoids a number of the concerns associated with row based deletion

As with everything, however, there are some trade offs - the biggest is SQL queries which have to traverse multiple shards. This behaviour is fully supported - the platform will simply UNION results from each shard however this does have an implied performance hit. For example if a list of all records from the syslog table is loaded via the user interface a query such as the following will be seen - ouch!

Queries running against a table using rotation should always define a relatively short window of time based on sys_created_on (as this is the field governing rotation) - this is known as 'time boxing' and ensures that the query will only hit a small number of shards (hopefully only a single shard) giving much improved query performance:

How table extensions differ:

Lets move on to table extensions. At the top of this article it was mentioned that customers might need to retain certain data sets for many years or even indefinitely - at a glance table rotation doesn't help with this as this strategy is designed to constantly recycle tables used for short lived/transient data. This is where extensions come in.

Fundamentally extensions are extremely similar to rotations:

They are defined in exactly the same place as table rotations (the sys_table_rotation table) - in this case, however, the rotation type is set to 'Extension'
They are based around a base table and a number of shards
They are given a duration which determines the period of time across which each shard is written to

Note, however, that the shards in a table extension do not rotate - instead of the oldest shard being truncated and re-used an extension will simply create new shards indefinitely such that data remains logically separated across shards due to creation date and no data is ever deleted. The audit (sys_audit) table is an extension in many customer instances which means that, over time, the logical audit table is backed by a whole number of physical shards, i.e.:

NOTE: The sys_audit table is not extended out-of-the-box and, in general, it is not recommended by ServiceNow to enable extension on the sys_audit table. Having the sys_audit table on extension can increase the time it takes to build a "History Set" (see KB0791687).

As with table rotations, any newly created shard will use the base table as its template. In addition, it's still necessary to execute SQL queries which are time boxed - imagine what a SQL query across all of the above instance's sys_audit shards would look/perform like!

The only other point to consider with table extensions is that no data will ever be deleted from any of the shards in an extension - every extension you configure will grow indefinitely/for ever! Also deletion of data from an extension can be tricky (covered further below) - if you should consider configuring an extension make sure that:

Business requirements really justify an extension (i.e. data in the extension must be kept indefinitely)
The only data written to the extension is data which really must be kept (as otherwise the size of your instance will continually grow at an elevated rate)

Regardless of the approach which is used table rotations and extensions can be an extremely powerful tool in handling large data sets.

Choosing between rotation or extension:

Rotation or extension?
- If the data being written is transient/only needs to be kept for a relatively short period of time use a rotation
- Extensions should only ever be used for data which must never be deleted
Length of duration?
- Determine the average rate at which data will be written to the rotation/extension and the largest data set which is likely to give reasonable SQL query performance
- Lets say your application writes 1m records per day an a reasonable data set across which to query is 10m records
- Duration should therefore be set to 10 days (or lower) to maintain good SQL query performance
Number of rotations?
- Obviously this only applies to table rotations and really depends on how long data needs to be kept/queried.
- Remember that one shard will always be empty (as its the next shard to be written to) and another shard will always be offline (cannot be queried via the user interface).
- As an example lets say shard duration is 10 days and data must be available to be queried for 50 days - in this case 7 rotations would be required (5 shards which can be queried, 1 offline shard, and 1 truncated shard)

How table cleaner fits into the puzzle:

Lets start with some important notes:

Table cleaner is not supported with tables using extension / rotation - table cleaner rules can be configured against such tables but they will be silently skipped
There is no benefit in configuring table cleaner against a rotated table anyway as, due to the way in which rotation works, any data stored in a rotated table is by definition transient
There might be scenarios where old data needs to be cleaned up from an extended table however this is not possible with table cleaner. In this scenario some kind of other approach (such as using a script) will be required. Note that even if data is removed from an extended table (or any other table) space on disk will not be reclaimed. Instead free space will be created within the table for use by future changes to data (i.e. inserts). In the case of an old extension shard there will never be any new inserts so this free space is useless. To physically reclaim space on disk the table must be rebuilt so a case should be opened with ServiceNow support after data has been deleted.
Table cleaner deletes with 'setWorkflow(false)' meaning that business rules, workflows, and flows which might be expected to trigger on record deletion will not in the context of table cleaner. This is important if you have business logic which depends on this type of functionality

Lets talk more about how table cleaner operates:

Table cleaner is a scheduled job which runs, by default, once per hour - its purpose is to delete older records from tables in the instance thereby keeping table sizes (and query performance) manageable
Tables against which to run/records to delete are governed by table cleaner rules (defined in the sys_auto_flush table) - each rule specifies:
- The target table (i.e. which table to delete from)
- A 'matchfield' - this must be a date/time field which can be compared to the current date/time
- Age in seconds - the delta between the value in a records 'matchfield' and the current date/time above which the record should become a candidate for deletion
- Conditions - a filter which allows further definition of exactly which records should be deleted (for example 'active = false AND state=closed')
- Cascade delete - this determines whether records which reference the table / records being deleted are also considered for deletion (depending on their cascade delete setting). By default this is set to false so cascade deletes will not take place. Enabling cascade deletes may avoid dangling references being left in other tables however may negatively impact the throughput of table cleaner deletions

An out of box example of a table cleaner rule against the sp_log table is shown below - this will delete any records where sys_created_on is more than 90 days old (note that no condition is provided so there is no further filtering of records to remove):

Each time the table cleaner job starts it iterates over active table cleaner rules deleting records from corresponding tables. Note that for each table it will generally:

Get a count of all records in the table
Get a count of records which are candidates for deletion
Start to delete the records (this is generally done in 'batches' of sys_ids for optimal performance)

This strategy does have a number of potential pitfalls, e.g.:

If the table is extremely large, obtaining the count of all records may take an excessive amount of time. Note, however, that in recent releases the SQL query to get this record count has improved (as it now estimates total rows in the table so this is less of a concern than it used to be)
If the table does not have a supporting index on the matchfield column, obtaining a count of records which are candidates for deletion or obtaining sys_ids for deletion can take an excessive amount of time

This means that table cleaner does not perform well against extremely large tables (i.e. its maybe not a good candidate to use against a large sys_audit table) and it is an absolute requirement to have a supporting database index on the physical column used as matchfield. If these items are not met, table cleaner will still run, however:

It might trigger extremely slow/expensive database queries which can have the potential to impact underlying database server CPU
It may spend a huge amount of time determining/deleting records - by default table cleaner will only spend 20 minutes maximum deleting from a single table so if queries are slow the volume of records deleted in 20 minutes may be small

That being said table cleaner is an extremely good strategy for regularly deleting relatively small numbers of records from a table to keep overall table size under control.

One final point is that, by default, table cleaner will track the volume of data removed from a table. If this is greater than 50% of all rows then table cleaner will trigger a compaction of the table. In short, this reorganises the physical storage of table/index data and removes fragmentation again helping to keep performance of SQL queries against the table as expected. If, for some reason, compaction is not required the sys_dictionary collection record from the table can have the 'no_optimize=true' attribute added. Note that with compaction disabled table cleaner will skip getting a total count of records in the table - in some cases (where the table is extremely large) this might help avoid an expensive/slow SQL query but it does mean that table cleaner will never compact the table.

Note that table cleaner is relatively configurable - there are various system properties controlling aspects such as the percentage of records which have to be removed before triggering table compaction, how long it will run against a single table and so on. All combinations are too lengthy to list here - generally the out of box defaults work well however if there is a need to tune support should be your first port of call.

A closing comment:

Hopefully this article helps to at least explain the purpose of table rotation, extension, and table cleaner. Unfortunately these are complex topics so, as a final note, if changes are being made in these areas always test them first in a sub production instance and if there is any doubt call support first. Support engineers are generally much happier discussing the best way to achieve a good outcome than dealing with the aftermath of unintended data loss!