Tables without Clustered Indexes?

Today, we experienced performance issues with some of the SSRS reports that were deployed as part of the latest application/database release. While investigating this performance problem, I realized that the underlying tables for these report queries do not have clustered index. I was even more surprised when I realized that some of these tables are huge and have few million rows in them. A good practice is to have a clustered index on all tables in SQL Server, as it helps to improve query performance. This is due to the fact that clustered indexes affect a table’s physical sort order, and a table that doesn’t have a clustered index is stored in a set of data pages called a heap where:

  • Data is stored in the order in which it is entered.
  • Rows are not stored in any particular order.
  • Pages aren’t sequenced in any particular order.
  • There is not a linked list linking the data pages.

When a table has clustered index, SQL Server physically sorts table rows in clustered index order based on clustered index key column values. In short, leaf node of clustered index contains data pages, and scanning them will return the actual data rows. Therefore, table can have only one clustered index.

When to have a clustered index on table?

Although it is not mandatory to have a clustered index per table, but, according to the MSDN article (Clustered Index Design Guidelines), with few exceptions, every table should have a clustered index defined on the column or columns that used as follows:

  • The table is large and does not have nonclustered index. Having clustered index improves performance, because without it, all rows of the table should be read to find any row.
  • Column or columns are frequently queried and data is returned in sorted ordered. Having clustered index on the sorting column or columns prevents sorting operation and returns the data in sorted order.
  • Column or columns are frequently queried and data is grouped together. As data must be sorted before it is grouped, having clustered index on the sorting column or columns prevents sorting operation.
  • Column or columns data that are frequently used in queries to search data ranges from the table. Having clustered indexes on the range column will avoid sorting entire table data.

So in order to resolve these performance issues, I re-wrote these queries and created clustered indexes on tables where appropriate. Moreover, I analyse further, and used the following two queries, to find out which tables in other databases do not have a clustered index defined.

The first query return names of all tables with row count greater than specified threshold, and do not have a clustered index defined. This query inner joins sys.tables system catalog to sys.dm_db_partition_stats dynamic management view to obtain this information (See below):

DECLARE @MinTableRowsThreshold [int];

SET @MinTableRowsThreshold = 5000;

;WITH    [TablesWithoutClusteredIndexes] ( [db_name], [table_name], [table_schema], [row_count] )
          AS ( SELECT   DB_NAME() ,
                        t.[name] ,
                        SCHEMA_NAME(t.[schema_id]) ,
                        SUM(ps.[row_count])
               FROM     [sys].[tables] t
                        INNER JOIN [sys].[dm_db_partition_stats] ps
						ON ps.[object_id] = t.[object_id]
               WHERE    OBJECTPROPERTY(t.[object_id], N'TableHasClustIndex') = 0
                        AND ps.[index_id] < 2
               GROUP BY t.[name] ,
                        t.[schema_id] )
    SELECT  *
    FROM    [TablesWithoutClusteredIndexes]
    WHERE   [row_count] > @MinTableRowsThreshold;

The second query is slightly a modified version of first query and returns the names of actively queried tables with row count greater than specified threshold, and do not have a clustered index defined. This query inner joins sys.dm_db_index_usage_stats to the first query to identify actively queried tables (See below):

DECLARE @MinTableRowsThreshold [int];

SET @MinTableRowsThreshold = 5000;

;WITH    [TablesWithoutClusteredIndexes] ( [db_name], [table_name], [table_schema], [row_count] )
          AS ( SELECT   DB_NAME() ,
                        t.[name] ,
                        SCHEMA_NAME(t.[schema_id]) ,
                        SUM(ps.[row_count])
               FROM     [sys].[tables] t
                        INNER JOIN [sys].[dm_db_partition_stats] ps
						ON ps.[object_id] = t.[object_id]
			            INNER JOIN [sys].[dm_db_index_usage_stats] us
						ON ps.[object_id] = us.[object_id]
               WHERE    OBJECTPROPERTY(t.[object_id], N'TableHasClustIndex') = 0
                        AND ps.[index_id] < 2
			AND COALESCE(us.[user_seeks] ,
				         us.[user_scans] ,
				         us.[user_lookups] ,
				         us.[user_updates]) IS NOT NULL
               GROUP BY t.[name] ,
                        t.[schema_id] )
    SELECT  *
    FROM    [TablesWithoutClusteredIndexes]
    WHERE   [row_count] > @MinTableRowsThreshold;

I hope you will find this post informative. For further information about clustered index design guideline, see MSDN resource here.

Further Reading:

Advertisement

Different approaches of counting number of rows in a table

Today, I received an email from the developer asking if there is a better way instead of the COUNT (*) Transact-SQL statement, to count the number of records in a table. My reply to his question is yes, there are several methods to get this information from SQL Server. However, none of these approaches are perfect, and has its own disadvantages. In this blog post, I will show different methods of count number of rows in a table.

Before discussing the different approaches, I must emphasize that the COUNT (*) statement gives you the true count of the total number of rows in a table. The COUNT (*) statement performs the full table scan on heap table and cluster index scan on tables with clustered index, to get the exact count of the records in a table. Because of this, it can get slower as the table gets bigger, as effectively it is counting each row separately, which includes the rows that contains null values. For more information about COUNT (*), see MSDN resource here.

 Let’s take a look at different approaches of counting number of records in a table:

Approach 1: Counting rows using sys.partitions catalog view

As we know that we can use sys.partitions catalog view to check the structure of the table. This catalog returns one row for each partition of all tables and most types of indexes (except Full-Text, Spatial, and XML are not included in this view) in the database. We can join the sys.partitions catalog view with sys.tables catalog view to quickly get row count for all tables in the database. Here is the sample sys.partitions query, which I use to get the row count for all tables in the database:

SET NOCOUNT ON;
SET STATISTICS IO ON;

-- Ensure a USE [databasename] statement has been executed first.
SELECT SCHEMA_NAME(t.[schema_id]) AS [table_schema]
      ,OBJECT_NAME(p.[object_id]) AS [table_name]
      ,SUM(p.[rows]) AS [row_count]
FROM [sys].[partitions] p
INNER JOIN [sys].[tables] t ON p.[object_id] = t.[object_id]
WHERE p.[index_id] < 2
GROUP BY p.[object_id]
	,t.[schema_id]
ORDER BY 1, 2 ASC
OPTION (RECOMPILE);

Here is output when I run it against AdventureWork2012 system database:

01

Although it is one of the fastest ways to count the number of rows in a table, however, according to sys.partitions documentation, the count is not always accurate. I personally could not find anything on the MSDN that tells me the cases where the count will not be accurate. The only situation in which I found count is not accurate is when I run this query while DML operations are in progress against the table for which I am counting.

Approach 2: Counting table rows using sys.dm_db_partition_stats dynamic management view

Like sys.partitions, we can use sys.dm_db_partition_stats dynamic management view to count the number of rows in a table. This dynamic management view contains row-count information for every partition and displays the information about the space used to store and manage different data allocation unit types.  According to MSDN, the row_count column of sys.dm_db_partition_stats dynamic management view is approximate value, and Microsoft never reveals cases where counts are not accurate. However, similar to sys.partitions catalog view, the only situation in which I found count is not accurate is when I use it while DML operations are in progress against the table for which I am counting. Here is my query, which combines sys.dm_db_partition_stats dynamic management view and sys.tables catalog view, to get the row count for all tables in the database:

SET NOCOUNT ON;
SET STATISTICS IO ON;

-- Ensure a USE [databasename] statement has been executed first.
SELECT SCHEMA_NAME(t.[schema_id]) AS [table_schema]
      ,t.[name] AS [table_name]
      ,SUM(ps.[row_count]) AS [row_count]
FROM [sys].[tables] t
INNER JOIN [sys].[dm_db_partition_stats] ps
     ON ps.[object_id] = t.[object_id]
WHERE [index_id] < 2
GROUP BY t.[name]
	,t.[schema_id]
ORDER BY 1, 2 ASC
OPTION (RECOMPILE);

Here is output when I run it against AdventureWork2012 system database:

02

When I compared STATISTICS IO output of both queries, I noticed that the sys.dm_db_partition_stats dynamic management view query version is slightly faster than sys.partitions catalog view query version. Examine the STATISTICS IO output of both queries that shows that SQL Server is doing a less work to retrieve the row count information using sys.dm_db_partition_stats dynamic management view query version.

STATISTICS IO output for sys.partitions version of query:

Table 'syssingleobjrefs'. Scan count 75, logical reads 151, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysidxstats'. Scan count 77, logical reads 175, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 0, logical reads 768, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysrowsets'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

STATISTICS IO output for sys.dm_db_partition_stats version of query:

Table 'syssingleobjrefs'. Scan count 75, logical reads 151, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysidxstats'. Scan count 76, logical reads 164, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'sysschobjs'. Scan count 0, logical reads 384, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Approach 3: Counting table rows using sp_spaceused system stored procedure

We can use the sp_spaceused system-stored procedure to return space usage information about a database or a table within a database. The syntax for this procedure is:

sp_spaceused [@objectname=['objectname']]
[,[@updateusage=['TRUE|FASLE']]]

You use the objectname parameter to enter the name of a table, indexed view, or Service Broker queue to retrieve space usage information specific to that object. If you set the updateusage to TRUE, DBCC UPDATEUSAGE updates usage information. The default value for this parameter is FALSE. You can run the command without any parameters to retrieve information about the current database:

EXEC [sp_spaceused]

This returns the following information:

  • Current database name
  • Current database size
  • Unallocated space
  • Reserved space
  • Space used by data
  • Space used by indexes
  • Unused space

To view information for a table in the current database, you must include the table name. For example, to look at information for the Person.Address table in the AdventureWorks2012 database, you would run:

EXEC [sp_spaceused] 'Person.Address'

This would return:

  • Table name
  • Number of rows in the table
  • Reserved space
  • Space used by data
  • Space used by indexes
  • Unused space

Here is the sample output in SSMS:

03

It is not necessary to include the parameter names (@parametername =) when running this command.

Finally, here is my script that uses sp_spaceused system-stored procedure to return the table usage information for all database tables, which includes count of rows in the table:

SET NOCOUNT ON;

-- Ensure a USE [databasename] statement has been executed first.
DECLARE @Database		            [nvarchar] (256)
       ,@TSQLCommand01              [nvarchar] (MAX)

SET @Database = DB_NAME()

IF OBJECT_ID(N'TempDb.dbo.#Table_Size_Info') IS NOT NULL
	DROP TABLE #Table_Size_Info

CREATE TABLE #Table_Size_Info (
	 [ID] [int] IDENTITY(1, 1) PRIMARY KEY
	,[ObjectName] [sysname]
	,[NumRows] [bigint]
	,[Reserved] [varchar](30)
	,[Data] [varchar](30)
	,[IndexSize] [varchar](30)
	,[Unused] [varchar](30)
	,[ObjectType] [char](1)
	)

SET @TSQLCommand01 = N''
SET @TSQLCommand01 = N'USE' + SPACE(1) + QUOTENAME(@Database) + N';' + SPACE(1) + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'DECLARE @SQLStatementID02 [smallint] ,' + SPACE(1) + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'@CurrentObjectSchema         [sysname] ,' + SPACE(1) + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'@CurrentObjectName           [sysname] ,' + SPACE(1) + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'@CurrentObjectFullName [sysname] ,' + SPACE(1) + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'@CurrentObjectType           [char](1)' + SPACE(1) + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'DECLARE @AllObjects TABLE ( [ID] [int] IDENTITY(1, 1) PRIMARY KEY , [ObjectSchema] [sysname] , [ObjectName] [sysname] , [ObjectType] [char](1) , [Completed] [bit] );' + SPACE(1) + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'INSERT INTO @AllObjects ([ObjectSchema], [ObjectName], [ObjectType], [Completed])' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'SELECT  [TABLE_SCHEMA] , [TABLE_NAME] , N''T'' , 0' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'FROM    [INFORMATION_SCHEMA].[TABLES]' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'WHERE   [TABLE_TYPE] = N''BASE TABLE''' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'AND CHARINDEX(N'''''''' , [TABLE_NAME]) = 0' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'ORDER BY [TABLE_SCHEMA], [TABLE_NAME]' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'SELECT @SQLStatementID02 = MIN([ID]) FROM @AllObjects WHERE [Completed] = 0' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'WHILE @SQLStatementID02 IS NOT NULL' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'BEGIN' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + CHAR(9) + N'SELECT @CurrentObjectSchema = [ObjectSchema] , @CurrentObjectName = [ObjectName] , @CurrentObjectType = [ObjectType]' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + CHAR(9) + N'FROM @AllObjects WHERE [ID] = @SQLStatementID02' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + CHAR(9) + N'SET @CurrentObjectFullName = QUOTENAME(@CurrentObjectSchema) + ''.'' + QUOTENAME(@CurrentObjectName)' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + CHAR(9) + N'INSERT INTO #Table_Size_Info ([ObjectName] , [NumRows] , [Reserved] , [Data] , [IndexSize] , [Unused] )' + CHAR(13) + N'EXEC [sp_spaceused] @CurrentObjectFullName' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + CHAR(9) + N'UPDATE #Table_Size_Info SET [ObjectName] = @CurrentObjectFullName , [ObjectType] = @CurrentObjectType WHERE [ID] = SCOPE_IDENTITY();' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + CHAR(9) + N'UPDATE @AllObjects' + CHAR(13) + N'SET [Completed] = 1' + CHAR(13) + N'WHERE [ID] = @SQLStatementID02' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + CHAR(9) + N'SELECT @SQLStatementID02 = MIN([ID]) FROM @AllObjects WHERE [Completed] = 0' + CHAR(13)
SET @TSQLCommand01 = @TSQLCommand01 + N'END' + CHAR(13)

EXEC [sp_executesql] @TSQLCommand01

SELECT *
FROM #Table_Size_Info
GO

Here is the output when I executed against AdventureWorks2012 database:

04

I have also compared STATISTICS IO output of this query with above two queries and noticed that it is much slower than other two queries that I shared in this blog post, but still much faster than equivalent SELECT COUNT(*) version.

I hope you liked this post 🙂 . Please feel free to use comment section, If you have any question regarding this post. You can also drop me question via twitter (@BasitAali).

Transact-SQL query to obtain basic information about partitioned tables

Here is the simple query that returns basic information about all tables in a database that are partitioned:

SELECT SCHEMA_NAME([schema_id]) AS [schema_name]
      ,t.[name] AS [table_name]
      ,i.[name] AS [index_name]
      ,i.[type_desc] AS [index_type]
      ,ps.[name] AS [partition_scheme]
      ,pf.[name] AS [partition_function]
      ,p.[partition_number]
      ,r.[value] AS [current_partition_range_boundary_value]
      ,p.[rows] AS [partition_rows]
      ,p.[data_compression_desc]
FROM sys.tables t
INNER JOIN sys.partitions p ON p.[object_id] = t.[object_id]
INNER JOIN sys.indexes i ON p.[object_id] = i.[object_id]
                           AND p.[index_id] = i.[index_id]
INNER JOIN sys.data_spaces ds ON i.[data_space_id] = ds.[data_space_id]
INNER JOIN sys.partition_schemes ps ON ds.[data_space_id] = ps.[data_space_id]
INNER JOIN sys.partition_functions pf ON ps.[function_id] = pf.[function_id]
LEFT JOIN sys.partition_range_values AS r ON pf.[function_id] = r.[function_id]
    AND r.[boundary_id] = p.[partition_number]
GROUP BY SCHEMA_NAME([schema_id])
        ,t.[name]
        ,i.[name]
        ,i.[type_desc]
        ,ps.[name]
        ,pf.[name]
        ,p.[partition_number]
        ,r.[value]
        ,p.[rows]
        ,p.[data_compression_desc]
ORDER BY SCHEMA_NAME([schema_id])
        ,t.[name]
        ,i.[name]
        ,p.[partition_number];

The query returns the following columns:

  • schema_name – name of the partitioned table schema.
  • table_name – name of the partitioned table.
  • index_name – name of the partitioned index.
  • index_type – type of the partitioned index.
  • partition_scheme – name of the partition scheme.
  • partition_function – name of the partition function.
  • partition_number – indicate the partition number.
  • current_partition_range_boundary_value – partition actual boundary value.
  • partition_rows – indicates approximate number of rows in current partition.
  • data_compression_desc – Indicates the state of compression for each partition.

This example query uses following system views: sys.tables, sys.partitions, sys.indexes, sys.partition_schemes, sys.partition_functions and sys.partition_range_values.

For more information on table partitioning, see “Partitioned Tables and Indexes” on MSDN website.