While I Compile

… I compile my thoughts about programming

An Abstract Data Model

This is post 3 from a 7 part series entitled Technical Achievements in my Last Project.

Usual N-Tier Application ModelOverview
Normally, when I build a new system, I design the new data model based on the requirements, and build my business objects and data access, based primarily on a that data model*. The remainder of the application is built on the components beneath it, so when you change something at the bottom, like the data model, changes ripple throughout the application. The data model serves as the foundation of my application.

Now as far as this project goes, one of the important requirements was to deliver the new system incrementally, while leaving the older system to run in parallel until completely replaced.

Parallel Data Models
This presented a bit of a dilemma for me since the current database was … well … lacking, and I was planning to refactor it enough to make it a very unstable foundation for the old system. I wanted to refactor it for a number of reasons including; missing primary keys, no foreign keys, no constraints, data fields which were required but not there, data fields which were there but not used, data fields containing 2 or more pieces of information, and tables which should have been multiple tables. Not to mention the desire to achieve a consistent naming convention without the insane column names using characters like ‘/’ and ‘?’ … seriously.

However the parallel systems requirement caused a bit of a dilemma. I mean, how do you manage parallel systems, one of which needs a stable foundation, and the other is so temperamental that you don’t want to touch it.

Parallel databasesMy options as I saw them were something like:

  1. Scrap the data model refactoring.
    This really didn’t get much thought. Well it did, but the thought was, is this the best route for the client? And if so, should I offer to help them find my replacement or just leave? I definitely wasn’t up for replacing one unmaintainable piece of junk for another.
  2. New data model and re-factor the existing app.
    The existing application was a total nightmare built in classic Access spaghetti code fashion. Just touching that looked like going down a rabbit hole of certain doom.
  3. New application on the old data model and refactor the data model later.
    This would have caused a real disconnect between the data model and the application. I’m not sure if the data model and application ever would have lined up properly. Not to mention the clients probable later decision of not completing that part of the project since everything worked. This seemed like a very bad idea.
  4. Parallel databases with synchronization

  5. Build a parallel data model for the new system, while leaving the old system as is.
    From a development point of view, this seemed like the best alternative, but keeping an active database in sync presented a serious, possibly unconquerable, challenge.

The final option of refactoring the data model immediately and basing all construction on a solid foundation was definitely the most appealing. But how do we keep it in synch? I’m sure there are tools out there for that, but with a possibly dramatically different data model? With active live data? Even if there are tools, I doubt the price would have been within the project’s budget**. And if it did exist how would we bring concurrency issues back to the users who caused the conflict?

Abstract Data Model
That’s when I had the idea; Why not just build an abstraction layer on the database? Why not manage the data all in one database while abstracting out the other data model? Why not build a simulated data model? Why not just redirect all my views and procs to the other database?

This was so bloody simple. Why hadn’t I ever heard of anybody else doing this before?

Parallel databases with abstraction layerSo the plan was to refactor the data model, build a concrete database, and instead of having stored procedures and views pointing to the tables like it was meant to, they would point to the tables in the other database. All changes would proceed as usual, for example; if the client had a change request which required a new field in a table, it would be added to the physical table, views and stored procedures would be updated, and the applications would change to accommodate. And when the old system was completely replaced, all that would need to be done, is to rewrite the DML to direct to the current system. Even the data transition would be easier since we’d already have views aggregating data in the expected format!

I was pretty excited about this when I designed it and told a few developer friends, who thought it was either stupid idea, problem ridden, or pointless at best. Now I do have a lot of stupid and pointless ideas, but didn’t feel like this was one of them.

Implementation Challenges
So how did I implement it?

Well once the new data model was finished, I wrote the views and stored procedures, as you might expect, but at this point you run into the following challenges:

  1. Required data missing from the existing database
    For example; A create date for products so business knows when a product was added to the system.
  2. Existing data in old system requires new values.
    For example; An order has a boolean status field for ‘pending’ & ‘completed’, but business requires status’s to be changed to ‘pending’, ‘ordered’, ‘shipped’
  3. Non-existing data tables need to be simulated
    For example; Lets say business wants the user to be able to request product literature on the order with regular products, you’ll need to simulate orders for product literature ordered via the old system.

The non-existing data tables were easily simulated with a view. However, these often came with a performance penalty. This is one of the few cases where the new application needed minor modifications to get around. Basically, different views were created for different situations, and the data access component would select the most appropriate view based on the circumstance.

Extension Tables
The missing required data and data changes (like status codes) were handled with extension tables.

So if I had a table named ‘order’ for example, I would create a new table called ‘order_x’, with a matching primary key column, plus columns for data that was required but missing, and data which required changing. Then insert, update, and delete triggers would be added to the ‘order’ table so changes from the old system would keep the extension table up to date. And procs and views on the new system would join the 2 tables to represent it as a cohesive unit.

If the current fields required value changes and/or new values, the new values would be stored in a field in the extension tables, and the update trigger on the main table would update the status when it changed from the old system. In situations where the data did not synch up 1 to 1, certain column mapping rules would be used. To extend on the order status example; ‘pending’ in the old system is the same as ‘pending’ in the new system, but what about ‘completed’? Is that ‘ordered’ or ‘shipped’? It might be mapped so if the old system updates the order to ‘completed’, it would change the extension table to ‘ordered’, and if the new system updated the status to either ‘ordered’ or ‘shipped’, the ‘order’ table status would be updated to ‘completed’.

The Dirty Data Problem
But the biggest problem was dirty data. This was a killer! This is the one challenge which plagued us throughout the entire project and knocked us off our schedule continuously. Because the old system was still being used, which offered the users absolutely no restrictions; we were getting situations which never could have been predicted. This was causing the application to act in unexpected ways, and even after making changes to accommodate the dirty data, we received endless support inquiries on unexpected behavior caused by null data and unexpected values.

There were changes to the application based on this as well. We actually had to change our business objects to set default enum values and make most properties nullable types, even though in the new data model, they were not nullable. This doesn’t effect input, but anywhere that data was being read from the database, we had to accommodate it. These nullable types will not require changing when the old system is completely replaced, but they do represent a smell which I hope somebody will eventually eliminate.

Conclusion
Overall though, I’d say this strategy was an overwhelming success. Other than the dirty data issue, which still rears it’s head every now and again, there have been no problems since it was first deployed.

If you can get away from a parallel deployment, I would recommend doing so, but if you can’t, I really think this strategy is a good one.

EDIT: After I posted this, it occurred to me that this strategy really cost almost nothing, since the biggest costs were in the setting up the views to extract the data out of the system in the expected format, which would have needed to be done when the data was moved to the new system anyway. The only real extra work was the extension tables and abstract procs, neither of which were very difficult once the mapping was established in the views. My colleague Ben Alabaster also pointed out that even if we bought an overpriced synch tool; configuration of the tool would have taken longer to setup than my solution.

This is post 3 from a 7 part series entitled Technical Achievements in my Last Project.

Credit-Thank you Ben Alabaster for the illustrations.


* I need a pretty good reason to build a data model and object model that are different. I’ve have done it, but its rare to have a compelling enough reason.

** At the time I wasn’t aware of any tools to do this. Karen Lopez was kind enough to let me know that TIBCO & Informatica may have done the job, but are expensive. From what I can tell, these tools would have been more expensive than the strategy I implemented. Thanks Karen.

Copyright © John MacIntyre 2010, All rights reserved

Advertisements

April 1, 2010 Posted by | Programming, SQL, SQL Server | , , , | Leave a comment

6 simple steps to a stress free database change deployment

A few years ago, I was contracted to work on an internal web application in a large Fortune 500 company. Our app was constantly in use, had thousands of users, was fairly active, and had a rather large database (one table had over 300 million records).

I was fortunate enough to work beside a DBA who had among other tasks, the responsibility of running deployment scripts against production. We talked occasionally, and he’d tell me about some of the deployment scripts he would run and the programmers he ran them for. He’d get programmers whose scripts never fail or report errors and would actually request the output text to confirm the DBA did their job. Then there were the programmers who would work against development, and when it was time to deploy, they’d attempt to remember all the subtle database changes made along the way and whack together an SQL script to repeat them. My friend was often rejecting scripts which were obviously never run before and sometimes he’d have to run expedited scripts, because the first was missing a necessary column or two.

Deployments can be a real headache at the best of times, but especially when schema updates to a production database are involved. Don’t get me wrong, you usually have a backup to fall back on, but how long will that take to restore? … Really, you don’t want to resort to the restore, have the database offline for that long, or have your name associated to it.

So gradually I evolved a process which has kept me sane and confident when deploying schema changes to production servers, even on large, sensitive, and active databases.

Here’s my simple 6 step process for deployment sanity. (FWIW-I’ve never called it my ‘6 step process…’ before writing this post. 😉

Step 1 – Create a central shared directory for all scripts

The first thing you do is store all database changes as scripts in a central shared directory.

I usually store the scripts in a directory structure like
\projects\ProjX\DeploymentScripts\dbname.production\
\projects\ProjX\DeploymentScripts\dbname.archive\deployed.YYMMDD\

Where the ‘dbname.production’ folder stores currently pending scripts. ‘dbname.archive’ stores previously deployed scripts each in their own dated subdirectory.

Step 2 – Create scripts for all changes

Any and all changes are scripted and stored. If you use a tool to adjust your database schema, then you export the generated SQL and run that exported SQL instead of running the tool directly against your database.

Keep scripts in order and don’t change old scripts. If … ok, when you make a mistake, don’t try to go back to an older script and change it, just write new one to undo the mistake.

Each script is saved in the ‘dbname.production’ folder with the following naming convention:

dbname.##.user.descr.sql

where:

dbname : the database to deploy it to (mostly so you can see it in your Query Analyzer title bar)
## : the sequence number to deploy it in (just increment the highest number)
user : initials of the programmer who wrote the script
descr : a brief description of what the script does

Here are some examples
HR.01.jrm.AdjustedTblEmployees_AddedStartDateColumn.sql
HR.02.jrm.AdjustedProc_sp_GetEmployees.sql
HR.03.jrm.AdjustedProc_sp_ActiveEmployees.sql

Step 3 – Number your scripts

As you may have noticed in step 2; scripts are numbered in the order they are created so object dependencies exist when run. Do you like dependent objects to exist? Yeah? … me too.

In case you’re wondering, the single digits are prefixed with ‘0’ so they will sort in Windows explorer properly. And there is only room for 2 digits since you will rarely if ever create more than 99 scripts before deploying.

Step 4 – Keep track of what has been deployed to where

Ok, so you’ve got your ‘dbname.production’ directory with 10 SQL files in there, you’ve got 3 developers each with their own copy of the database, a testing and/or business validation environment, and production.

Sooo … what’s been deployed to where? It’s pretty easy to lose track of what got deployed to validation last week. … Sorry … did we deploy 5? Or did we stop at 4?

I toyed with a few different ways of doing this, but finally settled on a very obvious and simple solution; keep an empty text file for each database, updating the number each time it’s run. Since I want them to sort intermingled with the script files, I use a similar naming convention:
dbname.##.zzz–DEPLOYED-TEST

Here’s some examples mixed in with the previous examples:
HR.00.zzz–DEPLOYED-DevBox2
HR.00.zzz–DEPLOYED-DevBox3
HR.01.jrm.AdjustedTblEmployees_AddedStartDateColumn.sql
HR.02.jrm.AdjustedProc_sp_GetEmployees.sql
HR.02.zzz–DEPLOYED-TEST
HR.03.jrm.AdjustedProc_sp_ActiveEmployees.sql
HR.03.zzz–DEPLOYED-DevBox1

Notice how they’re sorted amongst the scripts and it’s obvious what got deployed? I could probably delete the entire text of this post, other than that sample, and you could figure out my deployment process. … But since I enjoy typing … I’ll continue. 😉

When you finish executing your scripts, you would increment the number to reflect the last script run on that database.

Step 5 – Execute scripts in order

Obviously.

Step 6 – Post deployment clean up

When the scripts are deployed to production, you would create a new deployed.YYMMDD subdirectory with today’s date in the dbname.archive directory and move all the SQL files from the ‘dbname.production’ directory. Then Renumber the deployment marker files back to 00.

This can be incorporated into almost any database schema change deployment process, and isn’t specific to SQLServer. After all I started evolving this process while working on an Oracle database.

It has occurred to me that this may be an old school idea and modern database schema diff tools may provide a way to avoid this kind of preparation. But unless you are working as a sole developer; I can’t imagine a diff tool making your life easier than what I’ve outline above.

I hope your next deployment is less of a nail biter.

Copyright © John MacIntyre 2009, All rights reserved

WARNING – All source code is written to demonstrate the current concept. It may be unsafe and not exactly optimal.

May 1, 2009 Posted by | Programming | , , , | 2 Comments

How To Get The Most Frequently Used Column Values

Whenever I import external data, integrate to another database, or am new to a project, I need to get familiar with the database. The table schemas, relational integrity, and constraints are the first thing I look at and take me a long way, but soon I need to know what the data looks like.

In an ideal world, relational integrity and database constraints would define control this, and all I’d really need to do is look at those. But the reality is, in 15 years of working in this industry, most of the databases I’ve worked on, that I didn’t design, have barely used constraints and some haven’t even used relation integrity fully!

The need to get a good feel of the data is even more prevalent when working with dirty data, or when refactoring poorly written applications to ensure any refactoring doesn’t introduce other issues. I will usually wind up writing the following query repeatedly:

Select	column_name, count(*)
From	table_name
Group by column_name
Order by count(*) desc, column_name

This little query often reveals; inconsistencies between data and the application, where an application sets column X to possible values of ‘A’, ’B’, ‘C’, ‘D’, or ‘E’, but in reality, there may be zero ‘C’ and ‘E’ values in that column, but there is 6 ‘X’s, 1 ‘Q’, and an ‘?’. Or I may find that there are only 6 rows with data in that column, out of almost 3 million rows, indicating the column / application feature is unused.

Anyway, yesterday I finally wrote a little stored procedure which will print out the most frequent N values in each column for a specified table.

/*
Purpose : Retrieves the top N most frequent values in 
        each column of the specified data table.
Copyright © Jaten Systems Inc. 2009, All rights reserved
http://www.jatensystems.com/
https://whileicompile.wordpress.com/
http://www.johnmacintyre.ca/
*/
create proc jaten_diag_GetTableContents
(@tblName nvarchar(200),
@rowCount int)
as
begin
	declare @colId int;
	declare @colName nvarchar(100);
	declare @sql nvarchar(2048);
	declare @tblRowCount int;

	-- cursor to get columns
	DECLARE curColumns CURSOR FAST_FORWARD 
			FOR	SELECT	colid, [name] 
					FROM	syscolumns
					where	id=object_id(@tblName)
					order by colid
	open curColumns;

	-- get table row count
	set	@sql = 'select @retCount=count(*) from ' + @tblName;
	exec sp_executeSQL @sql, N'@retCount int OUTPUT', @retCount = @tblRowCount OUTPUT;

	-- print table header
	print '';
	print '---------------------------------';
	print '--- ' + @tblName;
	print '--- Row count : ' + cast(@tblRowCount as nvarchar)
	print '---------------------------------';

	-- get info for each column
	fetch next from curColumns into @colId, @colName;
	while 0 = @@fetch_status
	begin
			-- print column header
			print '';
			print '---------------------------------';
			print '--- Column [' + cast(@colId as nvarchar) + '] - ' + @colName + ' ---';

			-- compile & execute grouping sql
			select	@sql = 'select	top ' + cast(@rowCount as nvarchar) 
							+ '		count(*) as [count], '
							+ '		cast(((count(*) * 100)/' + cast( @tblRowCount as nvarchar) + ') as nvarchar) + ''%'' as [Percentage], ' 
							+ '		[' + @colName + '] as [col_value] ' 
							+ 'from	' + @tblName + ' ' 
							+ 'group by [' + @colName + '] ' 
							+ 'order by count(*) desc, [' + @colName + ']';
			exec sp_executeSQL @sql;
			--print @sql;

			-- next
			fetch next from curColumns into @colId, @colName;
	end

	-- clean up
	close curColumns;
	deallocate curColumns;
end

Please note 2 things :
1. You need to run it with ‘Results to Text’ or ‘Results to File’ setting.
2. The table parameter will need square brackets if the table name uses unconventional characters.

If you create it and run it in AdventureWorks on the ‘Production.Product’ table

exec jaten_diag_GetTableContents 'Production.Product', 5

… you will get these results

———————————
— Production.Product
— Row count : 504
———————————

———————————
— Column [1] – ProductID —
count Percentage col_value
———– ——————————- ———–
1 0% 1
1 0% 2
1 0% 3
1 0% 4
1 0% 316

(5 row(s) affected)

….

———————————
— Column [6] – Color —
count Percentage col_value
———– ——————————- —————
248 49% NULL
93 18% Black
43 8% Silver
38 7% Red
36 7% Yellow

(5 row(s) affected)

….

Notice how the Color column reveals that almost half of the products do not have a color setting? This could imply relevancy or this data possibly has a problem being maintained. But also, notice how unique columns will obviously provide meaningless data.

The AdventureWorks database is a very clean database, so this example is a bit contrived, but in the real world, there are plenty of databases where this little procedure will allow you to get some insight into the data.

How do you get familiar with new data?

Copyright © John MacIntyre 2009, All rights reserved

WARNING – All source code is written to demonstrate the current concept. It may be unsafe and not exactly optimal.

February 8, 2009 Posted by | Code, SQL Server | , , | Leave a comment

How To Write Dynamic SQL AND Prevent SQL Injection Attacks

One of my pet peeves is when general rules are taken as gospel, and declared as the only acceptable practice regardless of the circumstance.

One of the big ones is Dynamic SQL. There’s a heck of a good reason for this, and it’s called an SQL Injection Attack, and if you are not familiar with it, I would strongly urge you to leave this post right now, and read up on it.

Anyway, Dynamic SQL is not inherently evil, it’s the appending of user entered text that is evil. Appending user entered text is just lazy and can be easily avoided with parameterization.

The trick is to create dynamic SQL with parameters.

Interestingly, I’ve never seen anybody else do this. I am constantly hearing people recommending stored procedures … even when are clearly not flexible enough to meet the required functionality. Don’t get me wrong, stored procedures have a lot of benefits, but flexibility isn’t one it’s popular for.

And now for some code …

I created a console app which queries the SQL Server sample database AdventureWorks. The following static method was added to the Program class.

public static int GetOrderCount(string productPrefix, SqlConnection cn)
{
	// initialize SQL
	string starterSql = "SELECT count(*) FROM Production.Product";
	StringBuilder sbSql = new StringBuilder(starterSql);

	// add parameters
	if( !String.IsNullOrEmpty(productPrefix))
		sbSql.Append( " where [name] like @namePrefix");

	// initialize the command
	SqlCommand cmd = new SqlCommand(sbSql.ToString(), cn);
	if (cmd.CommandText.Contains("@namePrefix"))
		cmd.Parameters.AddWithValue("@namePrefix", productPrefix + "%");

	// get count
	return Convert.ToInt32( cmd.ExecuteScalar());
}

Basically, the function queries the number of orders where the product name starts with a certain prefix.

The strength of doing this via dynamic SQL is we only need to filter on the product name when a valid prefix parameter is passed in. So, if the optional parameter (productPrefix) exists and is valid, the filter condition is added to the SQL and the parameter is added to the SqlCommand object.

In this overly simplified example, we could manage the same thing by just setting the productPrefix variable to the ‘%’ wild card, but then we’d be doing a filter for nothing. Not to mention things might be a little more difficult if the operator were ‘equals’ instead of ‘like’, or if there were multiple optional parameters. Creating SQL dynamically means we don’t need to write some funky kludge and our SQL is always nice, simple, and doing minimal work.

To execute my function, I added the following code to the Main(…) method.

// get total count
Console.WriteLine( "There are {0} products in total.", 
			Program.GetOrderCount( null, cn));

// get totals for various prefixes
string[] prefixes = new string[6] { "a", "b", "c", 
						"'; drop table Production.Product;--", 
						"d", "e" };
foreach(string prefix in prefixes)
{
	Console.WriteLine("There are {0} products"
			+ " prefixed with '{1}'.",
			Program.GetOrderCount(prefix, cn), prefix);
}

First we call GetOrderCount(…) without a name prefix to test it without the parameter, then we traverse the array of possible prefixes (this would be the user entered data in a real app). Notice the fourth item? Pretty menacing eh? Don’t worry, it’s safe.

Here are the results

There are 504 products in total.


There are 3 products prefixed with ‘a’.
There are 4 products prefixed with ‘b’.
There are 12 products prefixed with ‘c’.
There are 0 products prefixed with ”; drop table Production.Product;–‘.
There are 3 products prefixed with ‘d’.
There are 9 products prefixed with ‘e’.


Notice the ‘d’ and ‘e’ prefixes were searched, and items found, proving the ‘drop table’ statement was not injected into the command.

You’d be surprised how much I use this. Many of my objects have a static GetList(…) method, and this method usually has multiple overloads. Keeping with the DRY principle, I prefer to keep all my logic in one place, so this method will usually have one overload with every possible filter parameter, and all the other overloads will just call this one. Surprisingly, the overload with the code, is not overly complex, and is actually pretty simple.

What do you think? Will you use parameterized dynamic sql in the future?

Copyright © John MacIntyre 2009, All rights reserved

WARNING – All source code is written to demonstrate the current concept. It may be unsafe and not exactly optimal.

February 5, 2009 Posted by | C#, Code, Security, SQL | , , , , , | 1 Comment

   

%d bloggers like this: