Sniffing out SQL Code Smells: Inconsistent use of Symbolic names and Datatypes

Posted by Phil Factor on Simple Talk See other posts from Simple Talk or by Phil Factor
Published on Fri, 10 Dec 2010 01:01:00 GMT Indexed on 2010/12/16 4:16 UTC
Read the original article Hit count: 534

Filed under:

It is an awkward feeling. You’ve just delivered a database application that seems to be working fine in production, and you just run a few checks on it. You discover that there is a potential bug that, out of sheer good chance, hasn’t kicked in to produce an error; but it lurks, like a smoking bomb. Worse, maybe you find that the bug has started its evil work of corrupting the data, but in ways that nobody has, so far detected. You investigate, and find the damage. You are somehow going to have to repair it.

Yes, it still very occasionally happens to me. It is not a nice feeling, and I do anything I can to prevent it happening. That’s why I’m interested in SQL code smells. SQL Code Smells aren’t necessarily bad practices, but just show you where to focus your attention when checking an application.

Sometimes with databases the bugs can be subtle. SQL is rather like HTML: the language does its best to try to carry out your wishes, rather than to be picky about your bugs. Most of the time, this is a great benefit, but not always. One particular place where this can be detrimental is where you have implicit conversion between different data types. Most of the time it is completely harmless but we’re concerned about the occasional time it isn’t. Let’s give an example: String truncation. Let’s give another even more frightening one, rounding errors on assignment to a number of different precision. Each requires a blog-post to explain in detail and I’m not now going to try. Just remember that it is not always a good idea to assign data to variables, parameters or even columns when they aren’t the same datatype, especially if you are relying on implicit conversion to work its magic.For details of the problem and the consequences, see here: SR0014: Data loss might occur when casting from {Type1} to {Type2} . For any experienced Database Developer, this is a more frightening read than a Vampire Story.

This is why one of the SQL Code Smells that makes me edgy, in my own or other peoples’ code, is to see parameters, variables and columns that have the same names and different datatypes. Whereas quite a lot of this is perfectly normal and natural, you need to check in case one of two things have gone wrong. Either sloppy naming, or mixed datatypes. Sure it is hard to remember whether you decided that the length of a log entry was 80 or 100 characters long, or the precision of a number. That is why a little check like this I’m going to show you is excellent for tidying up your code before you check it back into source Control!

1/ Checking Parameters only

If you were just going to check parameters, you might just do this. It simply groups all the parameters, either input or output, of all the routines (e.g. stored procedures or functions) by their name and checks to see, in the HAVING clause, whether their data types are all the same. If not, it lists all the examples and their origin (the routine)

Even this little check can occasionally be scarily revealing.

;WITH userParameter AS
( SELECT
  c.NAME AS ParameterName,
  OBJECT_SCHEMA_NAME(c.object_ID) + '.' + OBJECT_NAME(c.object_ID) AS ObjectName,
  t.name + ' '
    + CASE
    --we may have to put in the length
         WHEN t.name IN ('char', 'varchar', 'nchar', 'nvarchar')
             THEN '('
              + CASE WHEN c.max_length = -1 THEN 'MAX'
                ELSE CONVERT(VARCHAR(4),
                    CASE WHEN t.name IN ('nchar', 'nvarchar')
                      THEN c.max_length / 2 ELSE c.max_length
                    END)
                END + ')'
         WHEN t.name IN ('decimal', 'numeric')
             THEN '(' + CONVERT(VARCHAR(4), c.precision)
                  + ',' + CONVERT(VARCHAR(4), c.Scale) + ')'
         ELSE ''
      END  --we've done with putting in the length
      + CASE WHEN XML_collection_ID <> 0
         THEN --deal with object schema names
             '(' + CASE WHEN is_XML_Document = 1
                    THEN 'DOCUMENT '
                    ELSE 'CONTENT '
                   END
             + COALESCE(
               (SELECT QUOTENAME(ss.name) + '.' + QUOTENAME(sc.name)
                FROM sys.xml_schema_collections sc
                INNER JOIN Sys.Schemas ss ON sc.schema_ID = ss.schema_ID
                WHERE sc.xml_collection_ID = c.XML_collection_ID),'NULL') + ')'
          ELSE ''
         END
       AS [DataType]
  FROM sys.parameters c
  INNER JOIN sys.types t ON c.user_Type_ID = t.user_Type_ID
  WHERE OBJECT_SCHEMA_NAME(c.object_ID) <> 'sys'
   AND parameter_id>0)
SELECT CONVERT(CHAR(80),objectName+'.'+ParameterName),DataType FROM UserParameter
WHERE ParameterName IN
  (SELECT ParameterName FROM UserParameter
   GROUP BY ParameterName
   HAVING MIN(Datatype)<>MAX(DataType))
ORDER BY ParameterName

so, in a very small example here, we have a @ClosingDelimiter variable that is only CHAR(1) when, by the looks of it, it should be up to ten characters long, or even worse, a function that should be a char(1) and seems to let in a string of ten characters. Worth investigating. Then we have a @Comment variable that can't decide whether it is a VARCHAR(2000) or a VARCHAR(MAX)

2/ Columns and Parameters

Actually, once we’ve cleared up the mess we’ve made of our parameter-naming in the database we’re inspecting, we’re going to be more interested in listing both columns and parameters. We can do this by modifying the routine to list columns as well as parameters. Because of the slight complexity of creating the string version of the datatypes, we will create a fake table of both columns and parameters so that they can both be processed the same way. After all, we want the datatypes to match

Unfortunately, parameters do not expose all the attributes we are interested in, such as whether they are nullable (oh yes, subtle bugs happen if this isn’t consistent for a datatype). We’ll have to leave them out for this check.

Voila! A slight modification of the first routine

;WITH userObject AS
( SELECT
  Name AS DataName,--the actual name of the parameter or column ('@' removed)
  --and the qualified object name of the routine
  OBJECT_SCHEMA_NAME(ObjectID) + '.' + OBJECT_NAME(ObjectID) AS ObjectName,
  --now the harder bit: the definition of the datatype.
  TypeName + ' '
    + CASE
    --we may have to put in the length. e.g. CHAR (10)
         WHEN TypeName IN ('char', 'varchar', 'nchar', 'nvarchar')
             THEN '('
              + CASE WHEN MaxLength = -1 THEN 'MAX'
                ELSE CONVERT(VARCHAR(4),
                    CASE WHEN TypeName IN ('nchar', 'nvarchar')
                      THEN MaxLength / 2 ELSE MaxLength
                    END)
                END + ')'
         WHEN TypeName IN ('decimal', 'numeric')--a BCD number!
             THEN '(' + CONVERT(VARCHAR(4), Precision)
                  + ',' + CONVERT(VARCHAR(4), Scale) + ')'
         ELSE ''
      END  --we've done with putting in the length
      + CASE WHEN XML_collection_ID <> 0 --tush tush. XML
         THEN --deal with object schema names
             '(' + CASE WHEN is_XML_Document = 1
                    THEN 'DOCUMENT '
                    ELSE 'CONTENT '
                   END
             + COALESCE(
               (SELECT TOP 1 QUOTENAME(ss.name) + '.' + QUOTENAME(sc.Name)
                FROM sys.xml_schema_collections sc
                INNER JOIN Sys.Schemas ss ON sc.schema_ID = ss.schema_ID
                WHERE sc.xml_collection_ID = XML_collection_ID),'NULL') + ')'
          ELSE ''
         END
       AS [DataType],
       DataObjectType
  FROM
  (Select t.name AS TypeName, REPLACE(c.name,'@','') AS Name,
          c.max_length AS MaxLength, c.precision AS [Precision],
          c.scale AS [Scale], c.[Object_id] AS ObjectID, XML_collection_ID,
          is_XML_Document,'P' AS DataobjectType
  FROM sys.parameters c
  INNER JOIN sys.types t ON c.user_Type_ID = t.user_Type_ID
  AND parameter_id>0
  UNION all
  Select t.name AS TypeName, c.name AS Name, c.max_length AS MaxLength,
          c.precision AS [Precision], c.scale AS [Scale],
          c.[Object_id] AS ObjectID, XML_collection_ID,is_XML_Document,
          'C' AS DataobjectType

  FROM sys.columns c
  INNER JOIN sys.types t ON c.user_Type_ID = t.user_Type_ID
  WHERE OBJECT_SCHEMA_NAME(c.object_ID) <> 'sys'
  )f)
SELECT CONVERT(CHAR(80),objectName+'.'
   + CASE WHEN DataobjectType ='P' THEN '@' ELSE '' END + DataName),DataType
FROM UserObject
WHERE DataName IN
  (SELECT DataName FROM UserObject
   GROUP BY DataName
   HAVING MIN(Datatype)<>MAX(DataType))
ORDER BY DataName

Hmm. I can tell you I found quite a few minor issues with the various tabases I tested this on, and found some potential bugs that really leap out at you from the results. Here is the start of the result for AdventureWorks. Yes, AccountNumber is, for some reason, a Varchar(10) in the Customer table. Hmm. odd. Why is a city fifty characters long in that view? The idea of the description of a colour being 256 characters long seems over-ambitious. Go down the list and you'll spot other mistakes. There are no bugs, but just mess.

We started out with a listing to examine parameters, then we mixed parameters and columns. Our last listing is for a slightly more in-depth look at table columns. You’ll notice that we’ve delibarately removed the indication of whether a column is persisted, or is an identity column because that gives us false positives for our code smells. If you just want to browse your metadata for other reasons (and it can quite help in some circumstances) then uncomment them!

;WITH userColumns AS
( SELECT
  c.NAME AS columnName,
  OBJECT_SCHEMA_NAME(c.object_ID) + '.' + OBJECT_NAME(c.object_ID) AS ObjectName,
  REPLACE(t.name + ' '
  + CASE WHEN is_computed = 1 THEN ' AS ' + --do DDL for a computed column
          (SELECT definition FROM sys.computed_columns cc
           WHERE cc.object_id = c.object_id AND cc.column_ID = c.column_ID)
    --we may have to put in the length
         WHEN t.Name IN ('char', 'varchar', 'nchar', 'nvarchar')
             THEN '('
              + CASE WHEN c.Max_Length = -1 THEN 'MAX'
                ELSE CONVERT(VARCHAR(4),
                    CASE WHEN t.Name IN ('nchar', 'nvarchar')
                      THEN c.Max_Length / 2 ELSE c.Max_Length
                    END)
                END + ')'
       WHEN t.name IN ('decimal', 'numeric')
       THEN '(' + CONVERT(VARCHAR(4), c.precision) + ',' + CONVERT(VARCHAR(4), c.Scale) + ')'
       ELSE ''
      END + CASE WHEN c.is_rowguidcol = 1
          THEN ' ROWGUIDCOL'
          ELSE ''
         END + CASE WHEN XML_collection_ID <> 0
            THEN --deal with object schema names
             '(' + CASE WHEN is_XML_Document = 1
                THEN 'DOCUMENT '
                ELSE 'CONTENT '
               END + COALESCE((SELECT
                QUOTENAME(ss.name) + '.' + QUOTENAME(sc.name)
                FROM
                sys.xml_schema_collections sc
                INNER JOIN Sys.Schemas ss ON sc.schema_ID = ss.schema_ID
                WHERE
                sc.xml_collection_ID = c.XML_collection_ID),
                'NULL') + ')'
            ELSE ''
           END + CASE WHEN is_identity = 1
             THEN CASE WHEN OBJECTPROPERTY(object_id,
                'IsUserTable') = 1 AND COLUMNPROPERTY(object_id,
                c.name,
                'IsIDNotForRepl') = 0 AND OBJECTPROPERTY(object_id,
                'IsMSShipped') = 0
                THEN ''
                ELSE ' NOT FOR REPLICATION '
               END
             ELSE ''
            END + CASE WHEN c.is_nullable = 0
               THEN ' NOT NULL'
               ELSE ' NULL'
              END + CASE
                WHEN c.default_object_id <> 0
                THEN ' DEFAULT ' + object_Definition(c.default_object_id)
                ELSE ''
               END + CASE
                WHEN c.collation_name IS NULL
                THEN ''
                WHEN c.collation_name <> (SELECT
                collation_name
                FROM
                sys.databases
                WHERE
                name = DB_NAME()) COLLATE Latin1_General_CI_AS
                THEN COALESCE(' COLLATE ' + c.collation_name,
                '')
                ELSE ''
                END,'  ',' ') AS [DataType]
FROM sys.columns c
  INNER JOIN sys.types t ON c.user_Type_ID = t.user_Type_ID
  WHERE OBJECT_SCHEMA_NAME(c.object_ID) <> 'sys'

)
SELECT CONVERT(CHAR(80),objectName+'.'+columnName),DataType FROM UserColumns
WHERE columnName IN
(SELECT columnName FROM UserColumns
GROUP BY columnName
HAVING MIN(Datatype)<>MAX(DataType))
ORDER BY columnName

If you take a look down the results against Adventureworks, you'll see once again that there are things to investigate, mostly, in the illustration, discrepancies between null and non-null datatypes

So I here you ask, what about temporary variables within routines? If ever there was a source of elusive bugs, you'll find it there. Sadly, these temporary variables are not stored in the metadata so we'll have to find a more subtle way of flushing these out, and that will, I'm afraid, have to wait!

Developer IT