This is post 3 from a 7 part series entitled Technical Achievements in my Last Project.
Normally, when I build a new system, I design the new data model based on the requirements, and build my business objects and data access, based primarily on a that data model*. The remainder of the application is built on the components beneath it, so when you change something at the bottom, like the data model, changes ripple throughout the application. The data model serves as the foundation of my application.
Now as far as this project goes, one of the important requirements was to deliver the new system incrementally, while leaving the older system to run in parallel until completely replaced.
Parallel Data Models
This presented a bit of a dilemma for me since the current database was … well … lacking, and I was planning to refactor it enough to make it a very unstable foundation for the old system. I wanted to refactor it for a number of reasons including; missing primary keys, no foreign keys, no constraints, data fields which were required but not there, data fields which were there but not used, data fields containing 2 or more pieces of information, and tables which should have been multiple tables. Not to mention the desire to achieve a consistent naming convention without the insane column names using characters like ‘/’ and ‘?’ … seriously.
However the parallel systems requirement caused a bit of a dilemma. I mean, how do you manage parallel systems, one of which needs a stable foundation, and the other is so temperamental that you don’t want to touch it.
- Scrap the data model refactoring.
This really didn’t get much thought. Well it did, but the thought was, is this the best route for the client? And if so, should I offer to help them find my replacement or just leave? I definitely wasn’t up for replacing one unmaintainable piece of junk for another.
- New data model and re-factor the existing app.
The existing application was a total nightmare built in classic Access spaghetti code fashion. Just touching that looked like going down a rabbit hole of certain doom.
- New application on the old data model and refactor the data model later.
This would have caused a real disconnect between the data model and the application. I’m not sure if the data model and application ever would have lined up properly. Not to mention the clients probable later decision of not completing that part of the project since everything worked. This seemed like a very bad idea.
- Build a parallel data model for the new system, while leaving the old system as is.
From a development point of view, this seemed like the best alternative, but keeping an active database in sync presented a serious, possibly unconquerable, challenge.
The final option of refactoring the data model immediately and basing all construction on a solid foundation was definitely the most appealing. But how do we keep it in synch? I’m sure there are tools out there for that, but with a possibly dramatically different data model? With active live data? Even if there are tools, I doubt the price would have been within the project’s budget**. And if it did exist how would we bring concurrency issues back to the users who caused the conflict?
Abstract Data Model
That’s when I had the idea; Why not just build an abstraction layer on the database? Why not manage the data all in one database while abstracting out the other data model? Why not build a simulated data model? Why not just redirect all my views and procs to the other database?
This was so bloody simple. Why hadn’t I ever heard of anybody else doing this before?
So the plan was to refactor the data model, build a concrete database, and instead of having stored procedures and views pointing to the tables like it was meant to, they would point to the tables in the other database. All changes would proceed as usual, for example; if the client had a change request which required a new field in a table, it would be added to the physical table, views and stored procedures would be updated, and the applications would change to accommodate. And when the old system was completely replaced, all that would need to be done, is to rewrite the DML to direct to the current system. Even the data transition would be easier since we’d already have views aggregating data in the expected format!
I was pretty excited about this when I designed it and told a few developer friends, who thought it was either stupid idea, problem ridden, or pointless at best. Now I do have a lot of stupid and pointless ideas, but didn’t feel like this was one of them.
So how did I implement it?
Well once the new data model was finished, I wrote the views and stored procedures, as you might expect, but at this point you run into the following challenges:
- Required data missing from the existing database
For example; A create date for products so business knows when a product was added to the system.
- Existing data in old system requires new values.
For example; An order has a boolean status field for ‘pending’ & ‘completed’, but business requires status’s to be changed to ‘pending’, ‘ordered’, ‘shipped’
- Non-existing data tables need to be simulated
For example; Lets say business wants the user to be able to request product literature on the order with regular products, you’ll need to simulate orders for product literature ordered via the old system.
The non-existing data tables were easily simulated with a view. However, these often came with a performance penalty. This is one of the few cases where the new application needed minor modifications to get around. Basically, different views were created for different situations, and the data access component would select the most appropriate view based on the circumstance.
The missing required data and data changes (like status codes) were handled with extension tables.
So if I had a table named ‘order’ for example, I would create a new table called ‘order_x’, with a matching primary key column, plus columns for data that was required but missing, and data which required changing. Then insert, update, and delete triggers would be added to the ‘order’ table so changes from the old system would keep the extension table up to date. And procs and views on the new system would join the 2 tables to represent it as a cohesive unit.
If the current fields required value changes and/or new values, the new values would be stored in a field in the extension tables, and the update trigger on the main table would update the status when it changed from the old system. In situations where the data did not synch up 1 to 1, certain column mapping rules would be used. To extend on the order status example; ‘pending’ in the old system is the same as ‘pending’ in the new system, but what about ‘completed’? Is that ‘ordered’ or ‘shipped’? It might be mapped so if the old system updates the order to ‘completed’, it would change the extension table to ‘ordered’, and if the new system updated the status to either ‘ordered’ or ‘shipped’, the ‘order’ table status would be updated to ‘completed’.
The Dirty Data Problem
But the biggest problem was dirty data. This was a killer! This is the one challenge which plagued us throughout the entire project and knocked us off our schedule continuously. Because the old system was still being used, which offered the users absolutely no restrictions; we were getting situations which never could have been predicted. This was causing the application to act in unexpected ways, and even after making changes to accommodate the dirty data, we received endless support inquiries on unexpected behavior caused by null data and unexpected values.
There were changes to the application based on this as well. We actually had to change our business objects to set default enum values and make most properties nullable types, even though in the new data model, they were not nullable. This doesn’t effect input, but anywhere that data was being read from the database, we had to accommodate it. These nullable types will not require changing when the old system is completely replaced, but they do represent a smell which I hope somebody will eventually eliminate.
Overall though, I’d say this strategy was an overwhelming success. Other than the dirty data issue, which still rears it’s head every now and again, there have been no problems since it was first deployed.
If you can get away from a parallel deployment, I would recommend doing so, but if you can’t, I really think this strategy is a good one.
EDIT: After I posted this, it occurred to me that this strategy really cost almost nothing, since the biggest costs were in the setting up the views to extract the data out of the system in the expected format, which would have needed to be done when the data was moved to the new system anyway. The only real extra work was the extension tables and abstract procs, neither of which were very difficult once the mapping was established in the views. My colleague Ben Alabaster also pointed out that even if we bought an overpriced synch tool; configuration of the tool would have taken longer to setup than my solution.
This is post 3 from a 7 part series entitled Technical Achievements in my Last Project.
Credit-Thank you Ben Alabaster for the illustrations.
* I need a pretty good reason to build a data model and object model that are different. I’ve have done it, but its rare to have a compelling enough reason.
** At the time I wasn’t aware of any tools to do this. Karen Lopez was kind enough to let me know that TIBCO & Informatica may have done the job, but are expensive. From what I can tell, these tools would have been more expensive than the strategy I implemented. Thanks Karen.
Copyright © John MacIntyre 2010, All rights reserved
When I finish modeling my database, I like to just let it sit a couple days, then spend some time just reviewing it to check for inconsistencies. The kind of inconsistencies you never notice when you’re up to your eyeballs in details, but drive you up the wall after 5 years of maintenance.
When I finished my database design a few days ago, I jotted a few things down to remember to do. But when I transcribed them into my bug tracker, where I manage all my tasks, 3 things turned into 5, then 7 and I realized if I could just add a few more, I’d have an infamous ‘Top 10′ list.
Anyway here it is. It’s not exactly comprehensive, but it’s a start.
10 things to review after finishing your data model:
- Naming consistencies
- Column default consistencies
- Identity / auto-number technique exists
- Constraint consistencies
- Foreign Key relationships exist
- Indexes on Foreign Key columns
- Abbreviations are consistent
- Abbreviations are documented
- Data type and size consistencies. (For example; TableA.UserName is nvarchar(N), so TableB.UserName should be nvarchar(N) as well, not nvarchar(N±X))
- Review all requirements again to reconfirm everything was covered
I am building a web app with Ben Alabaster, and one of the requirements is for the user to be able to flag items for moderators. So the user can flag entity A, entity B, entity C, etc…
So I created a single flag table.
Which I then tried to tie it to the entity tables, hoping for something like
Where all the foreign key relationships were from [flag].[entity_id] to [EntityX].[id]
Then when I wanted the top 10 flags from a particular entity (B in this case), I could run a query like
select top 20 e.[name], count(*) "count"
from entityB as e
left join flag as f
on f.entity_id = e.id
group by e.[name]
order by count(*) desc
Unfortunately, if you were to create the above table relationship, and run the following inserts
insert into EntityA( id, name) values (1, 'EntityA');
insert into EntityB( id, name) values (2, 'EntityB');
insert into EntityC( id, name) values (3, 'EntityC');
insert into EntityD( id, name) values (4, 'EntityD');
The following statement
insert into flag(entity_id, flag_reason) values(5, 'Testing without a valid FK value.');
would fail as expected, as expected, with the following error. “The INSERT statement conflicted with the FOREIGN KEY constraint “FK_flag_EntityA”. The conflict occurred in database “test”, table “dbo.EntityA”, column ‘id’.”
insert into flag(entity_id, flag_reason) values(1, 'Testing the FK to entity A.');
would also fail, which was undesired, with the following error: “The INSERT statement conflicted with the FOREIGN KEY constraint “FK_flag_EntityB”. The conflict occurred in database “test”, table “dbo.EntityB”, column ‘id’.” +
So, my options with regards to referential integrity are :
- Ditch the referential integrity, which I am vehemently opposed to. ++
- Create multiple flag tables, each with the exact same schema, but a different Foreign Key relationship, which just seems wrong.
- Managing referential integrity via triggers.
While I’m not a big fan of triggers, the ‘Managing referential integrity via triggers.’ option seems like the only tolerable one. So I added the [entity_type] column to my flag table.
Removed the relationships
And wrote the following trigger to manage the foreign key relationship.
-- Description: maintain referential integrity on
-- a column which is a FK for different tables
CREATE TRIGGER flag_entity_id_fk
declare @entity_type char(1);
declare @entity_id int;
declare @cnt int;
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- get info
-- check if records exist
if 'A' = @entity_type
else if 'B' = @entity_type
else if 'C' = @entity_type
else if 'D' = @entity_type
-- records exist? exit
if 0 < @cnt
-- no? error
raiserror( 'Unable to find foriegn key match on entity type ''%s'', id ''%d''.', 16, 1, @entity_type, @entity_id);
Now, when you run
insert into flag(entity_type, entity_id, flag_reason) values('B', 5, 'Testing without a valid FK value.');
The trigger doesn’t find a match in the appropriate table, rolls back the insert, and gives you a descriptive error message.
Unable to find foriegn key match on entity type ‘B’, id ’5′.
However, a good value is accepted.
insert into flag(entity_type, entity_id, flag_reason) values('B', 2, 'Testing without a valid FK value.');
I’m still not happy with this approach, but it does seem to be the lesser of all the evils. Please let me know with a comment if there is another option I’ve overlooked. Thanks.
* Frankly I was surprised it even compiled.
+ Unless of course you were unfortunate enough to test this in a coincidental situation where all tables happened to contain the id of every test you ran.
++ Yes ‘vehemently’
EDIT (11/10/2009) : It just occurred to me that this article does not take into account what would happen if the entity tables were to delete a row which this table was pointing to. When I designed my tables this was taken into account, but since we are not planning to allow actual deletions, it was left out. However, if you were to implement this strategy, where entities could be deleted, a delete trigger would need to be created for each entity table.
Whenever I import external data, integrate to another database, or am new to a project, I need to get familiar with the database. The table schemas, relational integrity, and constraints are the first thing I look at and take me a long way, but soon I need to know what the data looks like.
In an ideal world, relational integrity and database constraints would define control this, and all I’d really need to do is look at those. But the reality is, in 15 years of working in this industry, most of the databases I’ve worked on, that I didn’t design, have barely used constraints and some haven’t even used relation integrity fully!
The need to get a good feel of the data is even more prevalent when working with dirty data, or when refactoring poorly written applications to ensure any refactoring doesn’t introduce other issues. I will usually wind up writing the following query repeatedly:
Select column_name, count(*) From table_name Group by column_name Order by count(*) desc, column_name
This little query often reveals; inconsistencies between data and the application, where an application sets column X to possible values of ‘A’, ’B’, ‘C’, ‘D’, or ‘E’, but in reality, there may be zero ‘C’ and ‘E’ values in that column, but there is 6 ‘X’s, 1 ‘Q’, and an ‘?’. Or I may find that there are only 6 rows with data in that column, out of almost 3 million rows, indicating the column / application feature is unused.
Anyway, yesterday I finally wrote a little stored procedure which will print out the most frequent N values in each column for a specified table.
/* Purpose : Retrieves the top N most frequent values in each column of the specified data table. Copyright © Jaten Systems Inc. 2009, All rights reserved http://www.jatensystems.com/ http://whileicompile.wordpress.com/ http://www.johnmacintyre.ca/ */ create proc jaten_diag_GetTableContents (@tblName nvarchar(200), @rowCount int) as begin declare @colId int; declare @colName nvarchar(100); declare @sql nvarchar(2048); declare @tblRowCount int; -- cursor to get columns DECLARE curColumns CURSOR FAST_FORWARD FOR SELECT colid, [name] FROM syscolumns where id=object_id(@tblName) order by colid open curColumns; -- get table row count set @sql = 'select @retCount=count(*) from ' + @tblName; exec sp_executeSQL @sql, N'@retCount int OUTPUT', @retCount = @tblRowCount OUTPUT; -- print table header print ''; print '---------------------------------'; print '--- ' + @tblName; print '--- Row count : ' + cast(@tblRowCount as nvarchar) print '---------------------------------'; -- get info for each column fetch next from curColumns into @colId, @colName; while 0 = @@fetch_status begin -- print column header print ''; print '---------------------------------'; print '--- Column [' + cast(@colId as nvarchar) + '] - ' + @colName + ' ---'; -- compile & execute grouping sql select @sql = 'select top ' + cast(@rowCount as nvarchar) + ' count(*) as [count], ' + ' cast(((count(*) * 100)/' + cast( @tblRowCount as nvarchar) + ') as nvarchar) + ''%'' as [Percentage], ' + ' [' + @colName + '] as [col_value] ' + 'from ' + @tblName + ' ' + 'group by [' + @colName + '] ' + 'order by count(*) desc, [' + @colName + ']'; exec sp_executeSQL @sql; --print @sql; -- next fetch next from curColumns into @colId, @colName; end -- clean up close curColumns; deallocate curColumns; end
Please note 2 things :
1. You need to run it with ‘Results to Text’ or ‘Results to File’ setting.
2. The table parameter will need square brackets if the table name uses unconventional characters.
If you create it and run it in AdventureWorks on the ‘Production.Product’ table
exec jaten_diag_GetTableContents 'Production.Product', 5
… you will get these results
— Row count : 504
— Column  – ProductID —
count Percentage col_value
———– ——————————- ———–
1 0% 1
1 0% 2
1 0% 3
1 0% 4
1 0% 316
(5 row(s) affected)
— Column  – Color —
count Percentage col_value
———– ——————————- —————
248 49% NULL
93 18% Black
43 8% Silver
38 7% Red
36 7% Yellow
(5 row(s) affected)
Notice how the Color column reveals that almost half of the products do not have a color setting? This could imply relevancy or this data possibly has a problem being maintained. But also, notice how unique columns will obviously provide meaningless data.
The AdventureWorks database is a very clean database, so this example is a bit contrived, but in the real world, there are plenty of databases where this little procedure will allow you to get some insight into the data.
How do you get familiar with new data?
Copyright © John MacIntyre 2009, All rights reserved
WARNING – All source code is written to demonstrate the current concept. It may be unsafe and not exactly optimal.