shannonlowder
shannonlowder
Shannon Lowder
213 posts
Look no further for expertise in: Business Analysis to gather the business requirements for the database; Database Architecting to design the logical design of the database; Database Development to actually build the objects needed by the business logic; finally, Database Administration to keep the database running in top form, and making sure there is a disaster recovery plan.
Don't wanna be here? Send us removal request.
shannonlowder · 13 years ago
Text
Changing Job and Database Owners to SA
[caption id="" align="alignleft" width="284"] I'm the Doctor[/caption]
One of the services we're trying to flesh out a bit more over at Hive Mind is our standard health check and resolution script.  Every DBA has his or her own set of scripts they carry around, either in a file, or a link to where to find it online.  Merging the best of all these scripts is difficult when you consider all the great resources within the SQL community.  One of the tasks addressed during a SQL Server health check is ownership of objects.
Whenever you have Jobs owned by users you run the risk of that account being removed, disabled, or corrupted in some way.  When that happens, your job is going to start failing you.  In a similar vein, database ownership gives owner a great deal of permission.  Probably more than is really needed to get the job done.  We can address both of these issues with some simple scripting.
In general, I'll change the job owners to SA without too much worrying.  I still generate a rollback script and hold onto it until well after the change, just in case.  Database ownership is a bit touchier.  Users could have built code that could break if you suddenly remove ownership, and don't give the affected users explicit permission to do their job!
My suggestion, work with your developers to determine what the minimum level of permission needed (for both the user and any code built and running against the database).  Generate scripts to grant those permissions.  Apply those first, then once you've worked out all the bugs in the permission change, change database owners to SA.
Now for the scripts!
Changing Job Owners
/****************************************************************************** Description: Script that change the Jobs Owner to SA. This Script will identify the non SA jobs, then create a rollback and a roll forward script that will change their owner as SA Compatibility : 2005 + Input: Folder where rollback and roll forward scripts can be stored *******************************************************************************/ --internal variables DECLARE @ScriptList TABLE ( ScriptType VARCHAR(20), Script VARCHAR(MAX)) --rollback scripts INSERT INTO @ScriptList SELECT 'rollback' as ScriptType , '-- Revert job: [' + j.name + '] to owned by: ['+ sl.name + ']' + CHAR(13) + CHAR(10) + 'EXEC msdb..sp_update_job @job_id = ''' + CONVERT(VARCHAR(36), j.job_id) + '''' + ', @owner_login_name = ''' + sl.name + '''' as Script FROM msdb.dbo.sysjobs j LEFT OUTER JOIN sys.syslogins sl ON j.owner_sid = sl.sid WHERE j.enabled = 1 AND sl.name <> SUSER_SNAME(0x01); INSERT INTO @ScriptList SELECT 'rollforward' as ScriptType , 'EXEC msdb..sp_update_job @job_id = ''' + CONVERT(VARCHAR(36), j.job_id) + '''' + ', @owner_login_name = ''SA''' as Script FROM msdb.dbo.sysjobs j LEFT OUTER JOIN sys.syslogins sl ON j.owner_sid = sl.sid WHERE j.enabled = 1 AND sl.name <> SUSER_SNAME(0x01); --you can either BCP the following results to a rollback file --or you can run the script in SSMS with ctrl+shift+F and save the results to a file SELECT * FROM @scriptlist WHERE ScriptType = 'rollback'; --Once you've saved your rollback script. Save your roll-forward script SELECT * FROM @scriptlist WHERE ScriptType = 'rollforward';
Changing Database Owners
/****************************************************************************** Description: Script that change the Database Owner to SA. This Script will identify the non SA owned databases, then create a rollback and a roll forward script that will change their owner as SA Compatibility : 2005 + Input: Folder where rollback and roll forward scripts can be stored *******************************************************************************/ --internal variables DECLARE @ScriptList TABLE ( ScriptType VARCHAR(20), Script VARCHAR(MAX)) INSERT INTO @ScriptList SELECT 'rollback' as ScriptType , '-- Revert database: [' + d.name + '] to owned by: ['+ SUSER_SNAME(owner_sid) + ']' + CHAR(13) + CHAR(10) + 'ALTER AUTHORIZATION ON DATABASE::' + d.name + ' to [' + SUSER_SNAME(owner_sid) + '];' FROM sys.databases d WHERE SUSER_SNAME(owner_sid) <> SUSER_SNAME(0x01); INSERT INTO @ScriptList SELECT 'rollforward' as ScriptType , 'ALTER AUTHORIZATION ON DATABASE::' + d.name + ' to SA;' FROM sys.databases d WHERE SUSER_SNAME(owner_sid) <> SUSER_SNAME(0x01); --you can either BCP the following results to a rollback file --or you can run the script in SSMS with ctrl+shift+F and save the results to a file SELECT * FROM @scriptlist WHERE ScriptType = 'rollback'; --Once you've saved your rollback script. Save your roll-forward script SELECT * FROM @scriptlist WHERE ScriptType = 'rollforward';
Conclusion
One of the things we're looking to do with this script is bundle it up with either PowerShell, SQLDiag, or some other tool, so we can simply deploy this script against a new server, and collect all these scripts automatically, with very little fuss or muss.
Do you have any suggestions for improvements to this script? Do you have a standard script you'd like to share? If so, let us know!
Original Article
0 notes
shannonlowder · 13 years ago
Text
Getting Back to Blogging
I'm sure you've noticed by now my job posting blogs have taken over. They were supposed to fill in on Tuesdays and Thursdays while I was supposed to continue blogging on Monday, Wednesday, and Friday. That hasn't worked out. Late last year I began building Hive Mind Consulting. That's taken up a lot of time!
Well, after a much needed vacation, I'm reading over several bloggers' "standard schedule" posts and trying to figure out a new schedule for myself. I'm working on finding time to fit in a full time day job, building a consulting business, blogging, studying more, and giving back to the community more than I have been the first half of this year. I know some things will have to be a smaller commitment than I gave previously. But I want to find that balance and give 100% all the time.
That's just who I am.
Now, let's work on this schedule! If you have any requests for a blog entry, send them in. I'm always looking for more ideas on what to cover here and on the new Hive Mind blog too.
Original Article
0 notes
shannonlowder · 13 years ago
Text
Why You Should Use a Remote DBA
Many companies are looking for database administrators today, I did a quick scan on a few job boards. There are nearly 800 DBA roles listed. After working with several smaller companies that didn't know they needed a DBA, I think the number of companies needing a DBA is significantly higher than that. While all of these companies have a need for DBA services, few of them actually need a permanent DBA on staff.
You aren't in the business of running a database
These companies are in the business of building widgets, or counting sales figures, or filing out TPS reports. Basically they make money by doing things other than managing that database.
If they bring on a full time DBA, that will force them to take their focus off their core business, and focus on managing the new employee. That time and energy would be better spent focusing on what makes them money. To be honest, few of these companies know how to manage their databases, let alone a database administrator.
At Hive Mind Consulting, I've worked to build a team that does nothing more than concentrate on making SQL Server work more effectively for companies. That's our business. It's what drives us. Let us help you manage your databases. We can share our years of experience with you, so you don't have to dilute your time learning to manage your databases on your own!
That frees you up to drive your business.
  You want lower costs
Let's say you hire the median DBA, Just average in every way, not a Rockstar or an Ozar, but an average DBA. That'll cost you about $87,490. But then you have to cover Social Security, part of their health insurance, time off, retirement. You have to buy him or her a computer to work with. You have to buy space for them to work in. You're looking at more than $125,000 to hire that median DBA.
Working with a remote or outsourced DBA saves you all those extra costs, right off the bat, an instant $37,000 in savings. We already have our machines and processes in place to help you maintain your SQL Servers. You don't have to buy those machines, pay for the real estate for our cubicles, and you don't have the additional costs of benefits and taxes!
We can also work with you to figure out how much, or little you need to manage your servers. At Hive Mind Consulting, we can provide you multiple tiers of service, each one tailored to how your business works, and how you want your database servers to run for you! These packages provide you access to our team of database administrators. That's right, working with Hive Mind means you're getting a whole team of administrators, not just one! Our basic level of support provides you a team of DBAs for less than the cost of a single DBA on-site!
You want the best talent
One of the core values we have at Hive Mind Consulting is improvement. We're always pushing ourselves to be better than we were the day before. We do that by studying SQL Server constantly. We go out of our way to learn new things about what SQL Server can do, and how to apply these techniques to new situations. This time and effort can cost money. While a majority of our training is free, some of it can be very costly.
At Hive Mind Consulting, we make that investment - our clients don't have to. As a result, our clients get access to the best talent without having to cover those costs. Our talent is already senior level, we're pushing them all into complete mastery. Why would you want anything less?
If you're considering hiring a database administrator, and you want to make sure you're getting the most for your money, call us at 704.hive.911, or email us at [email protected]. We can craft a solution from full service DBA, to on-demand consulting. Let us help you find the best solution for your company.
Original Article
0 notes
shannonlowder · 13 years ago
Text
Why should we migrate our ETL from DTS to SSIS?
I know I just threw a bunch of technical acronyms at you at once, but this is a topic that comes up more and more as people finally move off of SQL 2000 and onto a currently supported version.  Companies making this move will have a lot of old Data Transformation Service (DTS) packages that move data into their database or out of their database (Exchange, Transform and Load or ETL).  These packages have been around for years, and no one wants to fix something that isn’t broken.
But as soon as you upgrade to 2008 or later, DTS packages can break.
Limited DTS Functionality on 64-bit Operating Systems When is the last time you bought a server with a 32-bit processor?  It’s been a while hasn’t it?  Beginning with SQL 2008, you can only run DTS Packages (or SSIS Packages that are just wrappers for DTS packages) in 32-bit mode.  While not a show stopper, you could experience some performance implications of limiting your ETL processes to 32-bit performance.  You will be limited in the amount of memory your package can access if you use 32-bit mode, and when dealing with a lot of data, the more memory you have, the better your package will respond.
Design Support for DTS packages is non-existent in SQL 2008+ If you need to maintain your DTS packages after installing SQL 2008 or later you’re going to have to install SQL Server 2005 on a development machine, and then install an extra feature pack for that instance so you’ll be able to work on those packages.  Kind of odd that you’d need to keep SQL Server around in order to support a later edition of SQL, wouldn’t you say?
Speaking of designing and maintaining DTS packages going forward…
There are fewer database professionals with DTS than there were last year One of the cool analytics LinkedIn provides is insight into skills that are becoming more or less popular in the work force.  At the beginning of this year, linked in announced their findings for the year 2001.  There was a 19% decrease in the number of professionals listing DTS.  There are currently just over 18,000 professionals in LinkedIn who can work on your DTS packages.  Now that Microsoft has officially ended support for DTS packages as of SQL 2012, that number can only drop. As those numbers drop, supply and demand dictates you’re going to have to pay more and more for those professionals who remain.
Why not let Hive Mind Consulting help you convert your ETL processes to SSIS? We can help you realize performance gains by using your servers’ full capacities.  We can help you design and maintain newer SSIS packages to replace your current DTS packages.  We can provide training to your staff on how to support and develop these new packages.  We also can provide you a framework to use to simplify your conversion process!  Call us today… we can help!
Original Article
0 notes
shannonlowder · 13 years ago
Text
HELP! My Log Drive is Full!
A few weeks back I received a call from a client who isn't running one of our monitoring solutions.�� "Help!  ServerX is not responding!  I can't even connect to it via RDP."  This machine happens to be a virtual machine, so I ask if he can log in to the host machine, and check for any errors on that guest.
"Yeah, drive D is showing -1 MB free."
Wow, negative drive space!  That's pretty cool, I think to myself.  We run through a couple options to try and recover gracefully.  Those fail.  So, I drop back an extra 5 yards... I'm going for a Hail Mary.
"OK, I don't like this option but here's how we're going to fix it." I begin. 
I then walk him through forcing the guest machine to power off.  We add an drive space to D, and reboot the machine.  As soon as it boots, I RDP onto the box and look at the wreckage.  Turns out the log file had grown out of control.  I dig a little deeper and find that the database I'd set to full recovery mode for them almost a year back was still in recovery mode. 
Just one little problem...The maintenance plan to take a transaction log backup and full backup was disabled.
I walk the customer through a full backup.  I then resize the log file to an appropriate size.  And we re-enable both backup maintenance plans.  Once they're back up and running, and the threat of any additional failure is past we start digging in to the post mortem.  We want to find where the failure came from.
I dig through the agent log and find out when the last successful backups were ran.  I then work with the client to figure out what changed on that day.  Turns out a newer technician, who had very little SQL experience took over the maintenance of the server. 
Long story short, I ask for a private training session with the new technician.  We work through the different recovery types available in SQL Server.  We go over the pros and cons of each, as well as the costs of each.  I then run through a few scenarios of how we can have failures, and how we can recover from each.  I want the technician to be a little better equipped to handle the responsibilities he has, without calling him out in front of his bosses. 
In the end, I pitch the idea of setting up some monitoring to help catch this sort of event before it becomes critical.  Hopefully they choose to go with that monitoring.
There are two take-aways from this story.  One: know your recovery strategies in SQL Server, or contact us and we can help you determine which strategy is best for your company and your data.  Second: be on the look out for problems before they can do any real damage.  Prevention really is the best medicine.  We can help with that too!
Original Article
0 notes
shannonlowder · 13 years ago
Text
DATEPART(WK, DateColumn)
It's been a busy month since my last post.  Moving from contracting to consulting requires a more work that I'd thought at first.  In looking over the past few months I've built a few new queries that I haven't built before.  This first one comes from a report built for a client.  The request came in to break down aggregates into weeks.  Breaking down aggregates by year and month is simple.  SQL has functions that can do that no problem, you simply call YEAR(dateColumn) or MONTH(dateColumn).
Without thinking too much about it, I suggested using DATEPART, but I didn't know the exact parameter I'd use to be able to group my data by weeks.  Turns out I wanted to use DATEPART(WK, DateColumn).
SELECT DATEPART(wk, dateColumn) as WeekNumber , COUNT(*) as AggregateCount FROM sourceTable GROUP BY DATEPART(wk, dateColumn)
But this brought up the question, what's the week? Sunday to Saturday? Monday to Sunday? By default the first day of the week is Sunday. How can you see that? Well, let's check what day of the week this past Sunday (4/29/2012) was.
SELECT DATEPART(dw, '4/29/2012') AS 'This Past Sunday';
That's the way I choose my weeks, and the SQL Developers evidently agree with me, but our business users did not. So I needed to change the first day of the week to Monday. Fortunately, the SQL developers made it dead simple to change the first day of the week from Sunday to Monday. They gave us SET DATEFIRST.
All you have to do is change the day of the week to the number of the day of the week you want to be the first day of the week.
Value First day of the week is 1 Monday 2 Tuesday 3 Wednesday 4 Thursday 5 Friday 6 Saturday 7 (default, U.S. English) Sunday
*from msdn.
Simply call SET DATEFIRST before your query, and you can change the first day of the week for all queries using that connection, until it's closed.  You can also SELECT @@DATEFIRST to see what the current first day of the week is, in case you forget.
SET DATEFIRST 1; SELECT @@DATEFIRST AS 'First Day' ,DATEPART(dw, '4/29/2012') AS 'This Past Sunday' ,DATEPART(dw, '4/30/2012') AS 'This Past Monday';
Results in:
First Day This Past Sunday This Past Monday --------- ---------------- ---------------- 1 7 1 (1 row(s) affected)
  Now that you have a week number you can pretty up the results by converting your week numbers into week endings, week beginnings, or show the full range (first - last day of the week).  The sky's the limit when it comes to presenting the data.  Once we presented this option to the programmers, they took off with the idea, and presented several new reports now capable of breaking down their counts by week. It's good to be able to present a solution to a client and have them embrace it so completely!  It's what makes the job fun.  If you're looking for SQL Consulting help, feel free to contact me, we're here to help!
Original Article
0 notes
shannonlowder · 13 years ago
Text
Upgrade SQL Server
One of the most difficult projects to undertake as a DBA is upgrading your SQL Server from one version to another.  The biggest reason that it's so difficult is the lack of planning and testing before during and after the migration.  With proper planning, everyone involved with the migration from the DBA, to the programmers, to the business users can understand what is involved in the migration and what part they will play in this migration.
I'm going to walk you through the basic steps of the migration.  If you're a DBA, you can use these steps to help educate your coworkers on what it will take to make the migration happen successfully.  If you're a business user, this article will help you learn how much can be involved in a migration.  If you'd like help moving from one version of SQL Server to another, please feel free to contact me through my new consulting company, HiveMind.
Run the Best Practices Analyzer
For SQL Server 2000 through 2012, there are programs written by Microsoft to help determine if your SQL Server instance is configured according to the certified best practices.  It will help determine if there are any problems with your current server.  If there are, it can recommend fixes you may want to take care of before, or during the migration process. It can also help you determine if it will be safe to proceed with your migration.
If there are any issue preventing the migration, address those at this time, and then re-run the analyzer until it suggests it's safe to proceed.
Check for any queries that will break
When Microsoft updates the engine, they make changes that make SQL more efficient and robust.  As a result, some features of earlier versions of SQL Server might not work in later versions.  You need to check for this code before you try to upgrade, otherwise, you might find your system no longer works after the migration.
I suggest running a server side trace to determine if anything will break during the migration.  This trace will need to run long enough to cover a representative sample of your work load.  If your business cycle is weekly, you'll have to run the trace for a week.  If your cycle is monthly, then run the trace for a month, if your cycle is yearly, you might want to develop an alternative test plan to handle that.  Again, contact me if you need help with this testing!
If you determine something will break during the migration, you can either address it now, or you can address it during the test migration.  Some changes will need to be made before hand, others will only be possible after the upgrade.
Build a test environment
This is the step I see most migrations skip, and as a result, this is where most of the pain comes from.  You need to set up an environment where you can safely test your migration, and any changes you have to make in order to make the migration successful.  Without this testing, you simple upgrade to the new version and deal with problems in a live environment.  That usually leads to down time and upset users in the end.
By setting up an environment that mirrors the backend (SQL), the middleware (any code that touches the database), and the client side programs, you can see exactly how the migration will affect all your moving parts.  During this test period, you can address any problems that come up from the migration.
Make sure you test everything, trust me.  Anything you miss will turn up to give you headaches later!
Spin up the new servers and test again
I do not recommend in-line upgrades for SQL Server, rolling back to a previous version in-line is fraught with peril (Yeah, I've been reading too much Game of Thrones.)  If you've spun up your test environment with sufficient hardware, you can often migrate your test environment into production, and go live!
By keeping your existing systems online, you keep the option available to rollback, should any issues come up you can't immediately address.  After you're running your new servers in production, you need to repeat your tests again, verify everything is truly up and running.  If all your test cases pass, keep your previous servers for some period (usually 30-90 days) as a fallback solution.  But you will need to address how to migrate data from your newer versions to your older versions of SQL.
With proper planning, you'll address this before rolling to the new servers.
Success!
If you've followed all these steps, you're finished.  You've had a successful migration.  Now it's time to consider what to do with your old production servers.  I'd suggest considering them for Disaster Recovery Servers, or perhaps an additional test environment for your developers.  No matter what, I'm sure you'll find a good use for the older machines.
As always if you have any questions, let me know.  I'm here to help!
    Original Article
0 notes
shannonlowder · 13 years ago
Text
World Backup Day
This past Saturday was World Backup Day, It's not an actual government recognized holiday, but as a DBA you should start celebrating it.  Maybe celebrate isn't the right word for what you might want to do on WBD.  If you have a copy of the  SQL PASS Summit 2011 DVDs, check out Grant Fritchey's Lightning talk rant: Backup Testing.  It's a quick 5 minute rant on some of the things you're going to want to do to celebrate the day.
Backup databases you don't want to lose
It's as simple as that.  Pick the restore strategy that matches how much data you're willing to lose.  If you're ok only getting back to a specific point you made your backup, choose simple.  If you want a little more control, and want to be able to recover to a point in time, choose full.  If you want to get really fancy, and set up faster backups and restores, let's talk.  We can get really fancy with these backups.
Now that you have your backups...
Store them some place safe
Yeah, you're probably backing up to a disc drive or attached to your server, right? That's alright to use to create the backups, but you need to get them off the machine used to create the backups.  Your goal in creating backups has to be to protect against failure.  Are you protected from a machine failure if your backups are stored on the machine you're trying to protect?
The two ways I protect my backups is shipping the backups off to another machine on my network, or shipping them off to a cloud storage solution.  The cloud solution will cost you based on the amount of storage you use, but if you were to lose the data center that houses your SQL server and the backup server, you will still lose your backups.  That's why I like the cloud solution.  Your backup files are generally stored on several servers, rather than just one.
You do have the trade-off of restore time.  You'll have to download those files from the cloud service provider in order to restore them.  You'll have to consider the costs when deciding where to store your backups.
Test your backups
Like Grant mentioned in his Lightning rant: You can apply all sorts of settings to your backups to make sure they're written properly, but the only way you can be sure your backups are successful is to test them.  The best way to do that is to run a restore of your backups to another server, and make sure the databases come back online with all the right data on the other server.
So, let's make world backup day, world backup testing day too.  Pull your latest backups and try to restore them.  That way, if any fail to restore, you've got a chance to fix the problem before you need those backups.
It could mean the difference between being able to recover from a failure, and not recovering from that failure.
If you need help, or have any questions about planning for disaster recovery, or testing your backup strategy, let me know!  I'm here to help!
    Original Article
0 notes
shannonlowder · 13 years ago
Text
SQL203 -- Indexes can help|hurt
Yeah, I used a pipe in my title.  That’s because indexes can help some things, but hurt others.  The brief of it is indexing helps reads, but comes at a cost: your inserts, updates and deletes are going to take longer.  That’s because as you change your data, you’re going to have to update the index too.  Let’s dive into a scenario I was working on a few weeks ago.
Our “Bad” Table
So we had this table in our database that supports a high profile product from a company we all know.  We’re currently rolling out the largest instance of this product in the world.  So we’re pushing a lot more data into the system than any other users of this product.  So we expected to find places where the use of the table was greater than the table designers anticipated.
During several of our test deploys we discovered deadlocking on one process.  So we began to dig in.  We found that this one table had a lock that was being held whenever a user or process was accessing the system during a deploy.  This lock was caused by the fact our IO, although on a high end SAN, still couldn’t keep up with our usage (or misuse as we would soon discover).
After a little digging we narrowed the problem down to this one table.  whenever we deploy we have to read this table, find the object ids that are being updated during the deploy, and write some changes to this table.
Straightforward, right?
Well, not so much.  It turns out that this table has over 100 columns, and 42 non clustered indexes.
I was shocked.  How could a product from this company have a table like this?  So I started looking at the index statistics.
SELECT o.name Object_Name, i.name Index_name,i.Type_Desc, s.* FROM sys.objects AS o JOIN sys.indexes AS i ON o.object_id = i.object_id LEFT OUTER JOIN sys.dm_db_index_usage_stats AS s ON i.object_id = s.object_id  AND i.index_id = s.index_id WHERE o.name ='badTable'
Yeah, we weren’t using most of these indexes.  We used the clustered for our bookmark lookups, and a couple of the non-clustered indexes for seeks, but most of them had no reads, seeks, or scans.  But they all had remarkable update costs.
So during one of our meetings with the vendor about our deploy problem we disclose our findings.  They listen to us patiently and take notes on all we’ve found.  They take it back to their engineers and do some research of their own.  After a week, they present us with a fix.  Add 14 more indexes.
Really!?
We’re holding off on implementing these indexes until we can have another discussion about this issue, but it does bring up something you really must know if you’re going to architect database solutions.  You really have to understand that indexes will only speed up your reads.  They can change table scans into index seeks.  If you add a covering index you can even eliminate lookups.  But they come at a cost.
Your writes are going to take longer because you’re also going to have to write to the indexes.  So in our case, rather than updating just the table, you’re updating up to 42 non clustered indexes.  It depends on what columns you’re changing in the update.  You have to update each index that covers the value you changed.  This cost can become so high, you end up with deadlocks.
When you start designing indexes, look at the workload.  Don’t just throw indexes at a problem.  And for all that is good and just in the world, please, take anything the Database Engine Tuning Advisor says and test it before you send it out as a good suggestion.  It only looks at the one query you send it, not at your entire workload.  The DMV’s will help you look at that entire workload.
---
If you’re struggling with indexing, or anything else in SQL Server, let me know.  I’m here to help!
Original Article
0 notes
shannonlowder · 13 years ago
Text
SQL 201 - APPLY
Let's finish up with the SQL 210 series today with the APPLY operator. After you learn how to create your own functions, you're going to want to use them in new and interesting ways. Learning to use APPLY is like learning to use joins for functions. It let's you run the function against multiple rows at a time.
Just like joins, you've got a couple different APPLY functions.
CROSS APPLY
The CROSS APPLY is the INNER JOIN for functions. It only returns rows from your source table where you have a result in your function. Let's take an easy query that shows you sales by sales person.
SELECT c.FirstName, c.MiddleName, c.LastName, SUM(TotalDue) AS TotalSales FROM sales.SalesPerson sp INNER JOIN HumanResources.Employee e on sp.SalesPersonID = e.EmployeeID INNER JOIN Person.Contact c on e.ContactID = c.ContactID INNER JOIN Sales.SalesOrderHeader soh on sp.SalesPersonID = soh.SalesPersonID GROUP BY c.FirstName, c.MiddleName, c.LastName
  Now, let's turn that SalesOrderHeader lookup into a function.
CREATE FUNCTION f_getTotalSales(@SalesPersonID INT) RETURNS TABLE AS RETURN ( SELECT SUM(TotalDue) as TotalSales FROM Sales.SalesOrderHeader soh WHERE SalesPersonID = @SalesPersonID )
  Again, easy.
Now, let's replace the join to SalesOrderHeader into a function call.
SELECT c.FirstName, c.MiddleName, c.LastName, f.TotalSales FROM sales.SalesPerson sp INNER JOIN HumanResources.Employee e on sp.SalesPersonID = e.EmployeeID INNER JOIN Person.Contact c on e.ContactID = c.ContactID CROSS APPLY f_getTotalSales(sp.SalesPersonID) f
  You simply replace the INNER JOIN to SalesOrderHeader into the CROSS APPLY. The difference in syntax comes in when you pass in the value of the SalesPersonID you're trying to find the sales total for, rather than joining ON a criteria.
OUTER APPLY
The OUTER APPLY is the OUTER JOIN for functions. It will show you all the rows from your source table, even if they don't have a result in your function. In our case we're going to need to create a new sales person. Let's add someone with a sales title, who isn't already a sales person.
INSERT INTO sales.SalesPerson (SalesPersonID) SELECT EmployeeID FROM HumanResources.Employee e LEFT JOIN sales.SalesPerson sp on sp.SalesPersonID = e.EmployeeID WHERE title like '%sales%' and sp.SalesPersonID IS NULL
  Now we know we have sales people without and sales yet. So when we run our CROSS APPLY now, I see that Brian Welcker needs to get busy making some sales, he's got a lot of ground to cover to catch up with the rest of the pack.
SELECT c.FirstName, c.MiddleName, c.LastName, f.TotalSales FROM sales.SalesPerson sp INNER JOIN HumanResources.Employee e on sp.SalesPersonID = e.EmployeeID INNER JOIN Person.Contact c on e.ContactID = c.ContactID OUTER APPLY f_getTotalSales(sp.SalesPersonID) f
Summary
Learning to apply functions to queries is a very useful skill to have, especially when you want to start playing with dynamic management views. A couple of the really cool views are actually functions, and if you want to get useful data out, you're gonna have to use the APPLY operator, It's not that hard to learn them, just draw the analogy to joins in your mind, and it'll click.
Well, that's it for my SQL 201 series. For the next few months, I'm going to be studying more for the MCM, and playing with SQL 2012. As always I'll be sharing my findings here. If you have any questions on learning more SQL, just send them in. I'm here to help!
Original Article
0 notes
shannonlowder · 13 years ago
Text
New Job Postings for 02/09
Here are the latest jobs sent in. If you're interested in any of these jobs, let me know! There's an application on each job page.
#DEVELOPER in Toronto, Ontario Access #DEVELOPER in Flint, MI Software #DEVELOPER in Boise, ID Software #DEVELOPER in San Francisco, CA Software #DEVELOPER in Seattle, WA Software #DEVELOPER in Portland, OR Software #DEVELOPER in Bend, OR Web #DEVELOPER in Moorestown, NJ #dotNET #DEVELOPER Corporate Pleasanton, CA #SQL #DEVELOPER in Charlotte, NC
Original Article
0 notes
shannonlowder · 13 years ago
Text
Loading Hierarchical Data Using a WHILE EXISTS
Last time, we discussed three methods for inserting data into a destination table that has a recursive relationship.  In this article we're going to cover how to insert data into that table without vi0lating the foreign key constraint by using a WHILE EXISTS loop.
If you want to follow along with the examples, you're going to need my Staging table and my Destination table.
CREATE TABLE Staging (     AccountID UNIQUEIDENTIFIER NOT NULL     , Name NVARCHAR(160) NOT NULL     , ParentAccountId UNIQUEIDENTIFIER NULL     , ParentName NVARCHAR(160) NULL     , BatchID INT     , ConversionStatus INT ) --truncate table AccountBaseDestination CREATE TABLE Destination (     AccountID UNIQUEIDENTIFIER NOT NULL     , Name NVARCHAR(160) NOT NULL     , ParentAccountId UNIQUEIDENTIFIER NULL     , ParentName NVARCHAR(160) NULL     , CONSTRAINT [cndx_PrimaryKey_Account] PRIMARY KEY CLUSTERED ([AccountId] ASC) ) ALTER TABLE Destination  ADD  CONSTRAINT fk_Destination_Destination__ParentAccountID_AccountID      FOREIGN KEY(ParentAccountId) REFERENCES Destination ([AccountId])
Now let's put in some simple hierarchical data into our staging table. Basically I insert a row, grab the GUID created, and use that for the parentAccountID for the next row. That way each child will have the right GUID assigned for it's parent. Otherwise, the demo breaks!
--demo: several layers deep --first: empty the staging table TRUNCATE TABLE Staging --we need temp storage for an inserted AccountID to use as a parent account id DECLARE @ParentAccountID UNIQUEIDENTIFIER --next we need a table variable to trap inserted uniqueidentifers DECLARE @InsertedAccount TABLE ( AccountID UNIQUEIDENTIFIER ) --Let's insert a root node, and output the inserted Account ID INSERT INTO Staging (AccountID, Name) OUTPUT inserted.AccountID INTO @InsertedAccount SELECT NEWID() as accountID , 'Root Node' as Name --I know we'll just have one for this demo, so TOP 1 SELECT TOP 1 @ParentAccountID = AccountID FROM @InsertedAccount --clear out the @InsertedAccount before the next insert --otherwise you'll have Multiple ParentAccountIDs to deal with. DELETE FROM @InsertedAccount INSERT INTO Staging (AccountID, Name, ParentAccountID, parentName) OUTPUT inserted.AccountID INTO @InsertedAccount SELECT NEWID() as accountID , 'first child' as name , @ParentAccountID as ParentAccountID , 'Root Node' as parentName --pull out the accountid again SELECT TOP 1 @ParentAccountID = AccountID FROM @InsertedAccount DELETE FROM @InsertedAccount INSERT INTO Staging (AccountID, Name, ParentAccountID, parentName) OUTPUT inserted.AccountID INTO @InsertedAccount SELECT NEWID() as accountID , 'grandchild' as name , @ParentAccountID as ParentAccountID , 'first child' as parentName --pull out the accountid again SELECT TOP 1 @ParentAccountID = AccountID FROM @InsertedAccount DELETE FROM @InsertedAccount INSERT INTO Staging (AccountID, Name, ParentAccountID, parentName) OUTPUT inserted.AccountID INTO @InsertedAccount SELECT NEWID() as accountID , 'great grandchild' as name , @ParentAccountID as ParentAccountID , 'grandchild' as parentName --don't have to clear the table variable this time, we're done. SELECT * FROM Staging
In order to use a WHILE EXISTS loop, you have to be able to define the records you want to work on. For us we want to insert any records from Staging that either have their ParentAccountID value already in the Destination table in the AccountID column, or do not have a ParentAccountID value (NULL).
Here's my query that shows me the records we could insert.
SELECT * FROM Staging staging LEFT JOIN Destination destInsertedCheck on staging.AccountID = destInsertedCheck.AccountID LEFT JOIN Destination destParentInsertedCheck on staging.parentAccountID = destParentInsertedCheck.AccountID WHERE destInsertedCheck.accountID IS NULL --not already inserted and (destParentInsertedCheck.AccountID IS NOT NULL OR staging.parentAccountID IS NULL)--parent is defined or NULL
Notice only our Root Node record is shown. The next thing we need for a WHILE EXISTS loop is the query to do the work, in our case we need to insert these records into the Destination table.
INSERT INTO Destination (AccountID, Name, ParentAccountID, ParentName) SELECT staging.* FROM Staging staging LEFT JOIN Destination destInsertedCheck on staging.AccountID = destInsertedCheck.AccountID LEFT JOIN Destination destParentInsertedCheck on staging.parentAccountID = destParentInsertedCheck.AccountID WHERE destInsertedCheck.accountID IS NULL --not already inserted and (destParentInsertedCheck.AccountID IS NOT NULL OR staging.parentAccountID IS NULL) --parent is defined or NULL
Now, we just need to put it all together. I would suggest changing the SELECT * to SELECT somecolumn in the WHILE EXISTS check. There's no need to pull back the whole row, one non-nullable column would be fine. Even SELECT 1 would be fine.
Take a look at the full code now.
WHILE EXISTS ( SELECT staging.AccountID FROM Staging staging LEFT JOIN Destination destInsertedCheck on staging.AccountID = destInsertedCheck.AccountID LEFT JOIN Destination destParentInsertedCheck on staging.parentAccountID = destParentInsertedCheck.AccountID WHERE destInsertedCheck.accountID IS NULL --not already inserted and (destParentInsertedCheck.AccountID IS NOT NULL OR staging.parentAccountID IS NULL)--parent is defined or NULL ) BEGIN INSERT INTO Destination (AccountID, Name, ParentAccountID, ParentName) SELECT staging.* FROM Staging staging LEFT JOIN Destination destInsertedCheck on staging.AccountID = destInsertedCheck.AccountID LEFT JOIN Destination destParentInsertedCheck on staging.parentAccountID = destParentInsertedCheck.AccountID WHERE destInsertedCheck.accountID IS NULL --not already inserted and (destParentInsertedCheck.AccountID IS NOT NULL OR staging.parentAccountID IS NULL) --parent is defined or NULL END
So now, if you take a look at your Source and Destination tables, you'll see all the records from your source are in the destination, and you didn't get any foreign key contraint violations like we saw in the last article. Pretty easy, right? Next time, I'll show you how you can use a CTE to define what "level" a record is on, and process them from root level to leaf level. Until then, if you have any questions, let me know! I'm here to help.
Original Article
0 notes
shannonlowder · 13 years ago
Text
Loading Hierarchical Data
Earlier this week the question came  up on how can we insert data into a table that has a hierarchical data structure.  If you're not familiar with a hierarchical data structures, here's a quick explanation.
Consider the staging table to the right.  This table is used to load new accounts into our system.  Each account has a unique identifier and a name.  Since one company can own another, we have to be able to show that in our system.  In this system we have parent companies, and children companies.  The children companies' accounts are related to a parent account.
So in our destination table, this relationship would be enforced by adding a primary key on the AccountID, and then adding a foreign key on ParentAccountID to relate it to AccountID.
Fairly straightforward, right?
Well, when you're pushing thousands of rows into the table at a time, you can't control the order of the inserts.  As soon as you try to insert a record that has a ParentAccountID that hasn't yet been defined with a row using that AccountID, you're going to get an error message.
Msg 547, Level 16, State 0, Line 1 The INSERT statement conflicted with the FOREIGN KEY SAME TABLE constraint "fk_AccountBaseDestination_AccountBaseDestination__ParentAccountID_AccountID". The conflict occurred in database "RV_DataConversion_Stage", table "dbo.AccountBaseDestination", column 'AccountID'. The statement has been terminated.
So now we're to the problem at hand.  How can we make sure we don't try to insert a row with a parentID that's not defined?
Use A Cursor
It was proposed we use a cursor to run through the rows, grab a parent and all it's children, and insert the parent first, then insert all it's children.  Then return to the beginning of the set and repeat.  While that would avoid the foreign key violation, it would cost a lot of I/O (disk resources) to do it that way.
BTW: while this option was being presented, the presenter said we won't use cursors to do this.  But at the same time he was advocating a row-by-row operation.
If you've known me for any length of time, you know my stance on row-by-row operations in SQL.  I'm not a fan.
  Insert all the Records, Then Update the ParentAccountID
Another proposal was to insert all the records, and not pass in the ParentAccountID.  Then we could run an update statement against the Destination table where that parent had been defined.  If there were any rows that needed a ParentAccountID where that AccountID was not defined, we simply wouldn't update that row.
Basically the update would be an update with an INNER JOIN, so we'd avoid the violation that way.
Not bad, but if the destination table was being used, users might see a ton of new accounts at the root level, not at whatever level they're supposed to be at.  You also would require locks on the destination table during the update that might escalate from row locks, to page locks, to table locks.  You'd have to test a load where you were adding 25 - 50% new records.  So if your destination table had 100 records, test a load of 25 new records.  Then test a load of 50 records.  During the load watch what locks are being taken?
If you see it escalate to a table lock, you may not want to do that to your users, if they are on the destination table regularly.
Use a CTE
A rather interesting idea that came up was to write a CTE that would select the staging data and add on a LevelNumber column.  This column would indicate what level in the hierarchy the account was on.  For example, level 1 would be the root node, level 2 would be children of accounts on level 1, level 3 would be children of level 2, and so on.
Once we know what levels each account is on, we could load all the level 1 accounts, then all the level 2 accounts, etc.  Not a bad idea.  Especially if you have your insert has to go through an SSIS package, or web service to do the actual insert into the database, you could be sure you're only adding records in a integrity-safe way!
This is the method we ended up going with, due to the fact the insert was going through a webservice.  But... there is another way.
WHILE EXISTS Loop
The last method we discussed was first, we insert all the root nodes (level 1 from the previous image), and then we use a WHILE EXISTS loop to insert children where their parent was defined.  The end result is very similar to the STE solution, in fact the costs are almost identical.  You have to "loop" through the code once for every level.  The difference here is in order to use the WHILE EXISTS method, you have to be able to run the actual INSERT statements.  And that's why we didn't chose this method this time.
In my next training article I'm going to go through this method, so you can see the design pattern.
Conclusion
In the end we went with the method that used the least impact on the server, hits all of the requirements, and gets the job done with the least effort.  There's always a lot of back and forth in these design meetings.  It's all a part of the process of designing good code.  You can't get too invested in "my way is best, I won't do it any other way."  That's just counter-productive.
You have to open your ideas up to scrutiny and participate in the give and take.  In the end, you have to work with these people long after the process has been implemented.  You're going to have to go through another design meeting.  You don't want to make the next one even harder because you had to have your way...right?
Original Article
0 notes
shannonlowder · 13 years ago
Text
Disaster Recovery and High Availability Planning
How long could your company go if your database server simply stopped?
Many of my clients have never even thought about this question. It's a question you need to think about, and consider carefully.
Once you know how long you're able to work without that server, you need to build plans on how to recover from failures. They're going to happen. Everything breaks. Being prepared to handle failures when they happen can be what separates your company from the rest.
That's where I come in. I can work with you to determine how much downtime you can afford. I can then work with you to offer a suite of solutions designed to bring your services back online with the least impact to your business.
Database Integrity Checks
Checking your database for corruption should be a regular part of your company's maintenance schedule. Once you've implemented those checks, you have to design an action plan on how to recover from many of the issues that can be uncovered by these checks. Are you prepared for corrupted databases?
Backup strategy
Making backups, storing them securely, and testing them regularly will prepare you for some of the most difficult failures. With a proper backup strategy you can plan for drive failures, cyber-terrorism, or critical server failures. Many of these could end the unprepared company. Let me help you design a strategy that protects you and your business today!
Server Redundancy Planning and Implementation
Servers require updates to keep running smoothly. Many of these updates require you to restart the server. As systems grow more and more complex, these restarts can begin to take five to ten minutes. I've seen several servers that would take more than an hour to come back online after a scheduled reboot. Can your business handle the database server being offline for an hour or more at a time? If not, let me help you build a solution that provides you with greater up-time!
Site Redundancy Planning
Does your business support geographically diverse clients? Is your company in the south east united states? Can your business handle it if a Hurricane were to destroy powerlines leaving your database server without power for a week? Many companies support world-wide clients, and have no redundancy for their data center. If a natural disaster were to desrupt or destroy their data center, they have no way of resuming work until the disruption is cleared, or the data center is rebuilt. Let me help you plan for this scenario. Cloud computing offers a very cost-effective solution for this scenario!
When talking about Disaster Recovery and High Availability my favorite quote is from Winston Churchill. "He who fails to plan is planning to fail." My business is helping you plan, and then take action.
[si-contact-form form='2']
Original Article
0 notes
shannonlowder · 13 years ago
Text
Improve Performance by Replacing Cursors
This is the last article for my upcoming design patterns session. Today we're going to look into speeding up your queries by replacing cursors, a row-by-row operation with a table value function. Before you start composing that hate-mail on how I'm wrong for not understanding cursors, let me say this: there is a time and a place for everything, even cursors (think running totals). The problem is I still see them doing row-by-row operations in SQL. And SQL is not optimized for row-by-row. It is optimized for set-based operations.
If you can rethink your T-SQL logic and consider the whole set of data you'll be working on, rather than what you have to do to each row, you'll find you're rewarded by the SQL Engine. Your results come back in record time!
Let's look at an example I created to illustrate a row-by-row operation. This cursor takes in a column containing a user's full name, and breaks it into first and last name using a space as the break point.
If you want to follow along, and you have my OriginalTroubledTable, use the following code to set up the NameSplitTable.
IF OBJECT_ID('adventureworks.dbo.NameSplitTable') IS NOT NULL DROP TABLE NameSplitTable GO CREATE TABLE [dbo].[NameSplitTable]( [NameID] INT IDENTITY(1,1) , [FullName] VARCHAR(100) NULL , [FirstName] [varchar](50) NULL , [LastName] [varchar](50) NULL ) ON [PRIMARY] GO --I want to have more than 10k rows so repeat inserts until we get more! WHILE (SELECT COUNT(*) from [NameSplitTable]) < 10000 BEGIN INSERT INTO [NameSplitTable] ([FullName]) SELECT firstname + ' ' + LastName FROM OriginalTroubledTable END
If we were doing a single row, the following code wouldn't be that bad.
DECLARE @fullname varchar(100) DECLARE @NameID INT DECLARE @splitPos INT SELECT TOP 1 @NameID = NameID , @fullName = FullName FROM [NameSplitTable] SET @splitPos = CHARINDEX(' ', @fullname) IF(@splitPos > 0) BEGIN UPDATE [NameSplitTable] SET FirstName = SUBSTRING(@fullName, 1, @splitPos) , LastName = SUBSTRING(@fullName, @splitPos, LEN(@fullName)) WHERE NameID = @NameID END
Look at the execution plan, the cost of the UPDATE isn't horrible. In my test environment the execution took less than one second.
The real cost of this query comes in when you run this code once for every row of the table. In my test case, I have 11,760 rows. That means that this table scan is going to scan through the table 11,760 times. That will add up quickly! Let's see the Cursor version of the code.
DECLARE @fullname varchar(100) DECLARE @NameID INT DECLARE @splitPos INT DECLARE FullNames CURSOR FOR SELECT NameID, FullName FROM [NameSplitTable]; OPEN FullNames; FETCH NEXT FROM FullNames INTO @NameID, @fullname; WHILE @@FETCH_STATUS = 0 BEGIN SET @splitPos = CHARINDEX(' ', @fullname) IF(@splitPos > 0) BEGIN UPDATE [NameSplitTable] SET FirstName = SUBSTRING(@fullName, 1, @splitPos) , LastName = SUBSTRING(@fullName, @splitPos, LEN(@fullName)) WHERE NameID = @NameID END FETCH NEXT FROM FullNames INTO @NameID, @fullname; END CLOSE FullNames DEALLOCATE FullNames
You can run this on your machine if you have a lot of patience, but I'm going to run it for 60 seconds, and see how many rows I can update in 60 seconds. I was able to update 5,734 rows. And that's the big problem with cursors, each operation could be very quick, but the cumulative effect of running the code so many times adds up!
In this case it's the cumulative effect of doing that many table scans and table updates to do the work. How much faster would it be if I could update all the rows at once?
Let's reset our test table and find out.
update [NameSplitTable] set FirstName = NULL, LastName = NULL
First, in order to be able to do this string manipulation across multiple rows we're going to need a table value function to do the same substring we did in the cursor.
CREATE FUNCTION tvf_splitNames ( @fullName VARCHAR(100) ) RETURNS @output TABLE ( [FirstName] [varchar](50) NULL , [LastName] [varchar](50) NULL ) AS BEGIN DECLARE @splitPos INT SET @splitPos = CHARINDEX(' ', @fullname) IF(@splitPos > 0) INSERT INTO @output SELECT SUBSTRING(@fullName, 1, @splitPos) FirstName , SUBSTRING(@fullName, @splitPos, LEN(@fullName)) AS LastName ELSE INSERT INTO @output SELECT @fullname as FirstName, NULL as LastName RETURN END
Now that we have a table value function we can use the CROSS APPLY technique to apply this function to all the rows in our NameSplitTable. We simply map in The full name, and pull out the resulting first and last names.
SELECT nst.FullName , tvf.FirstName , tvf.LastName FROM [NameSplitTable] nst CROSS APPLY tvf_splitNames (nst.FullName) tvf
Yes, this is only a select, but I wanted to show you how 84% cost to the table scan was changed once we move from row-by-row to Set based processing.
We reduced the cost of the table scan, by adding a table valued function and a nested loop. Yes, this is apples and oranges to the original function, but run the code. We were able to select back 11,760 rows, getting their first and last names in 9 seconds.
The cursor took 2:35 to actually update all the rows.
UPDATE nst SET nst.FirstName = tvf.firstName , nst.LastName = tvf.LastName FROM [NameSplitTable] nst CROSS APPLY tvf_splitNames (nst.FullName) tvf
The cross apply updated it in 1 second.
Now that is apples to apples.
One of the harder things to do in SQL is knowing how well your code is going to scale. If you write code thinking your workload will be a handful of records once a day, and that code is actually run several hundred thousand times per minute, you may not be happy with how quickly the work is performed.
The trick is always learning what SQL does well, and what it doesn't. Once you understand that, you'll begin to know what you have to change to speed things up!
That's it for my design patterns session. As always, if you have any questions, please let me know!
Original Article
0 notes
shannonlowder · 13 years ago
Text
Upserts
Up to this point in our design patterns sessions we've concentrated on removing duplicates from our troubled table.  Now, let's look into preventing those duplicates from being inserted in the first place.  If you haven't heard the phrase upsert before, it's the idea of doing an INSERT and UPDATE in the same statement, that way you handle changing data and inserting it in one go.  It's very helpful for preventing duplicate data from being inserted.
For this article we're going to use T-SQL to prevent these duplicates, rather than establishing a constraint, and then building an error handler when we try to insert duplicate data.  Let's dig in.
CREATE TABLE [dbo].[TroubledTable](     [FirstName] [varchar](50) NULL,     [LastName] [varchar](50) NULL,     [email] [varchar](255) NULL )
Consider our simple troubled table.  This little table has done a good job of collecting contact information so far, we decide, we have no good reason to change the structure, we just want to make it so if a user wants to enter the same row twice, it won't actually INSERT into the table.  If we use an IF THEN construct, we can check to make sure that row doesn't already exist.  Once we know the record doesn't exist, we can run the INSERT statement.  Here's what that would look like.
IF NOT EXISTS ( SELECT FirstName FROM TroubledTable WHERE FirstName = AND LastName = AND email = ) INSERT INTO TroubledTable (FirstName, LastName, email) VALUES , ,
Pretty easy right?  Now this solution doesn't really allow us to update information.  If we wanted to allow updates, we could add a primary key to the table, so we can uniquely identify each row, we could change the construct to search for a primary key, and when found UPDATE, otherwise it runs an INSERT.  In this new scenario we could have two people with the same name, but no two users are allowed to have the same email address.
First we set up a variable to hold the primary key in the case we want to do an update.
DECLARE @PKToUpdate INT
Next, we're going to look for a value for @PKToUpdate.
SELECT @PKToUpdate = ttPrimaryKey FROM TroubledTableWithPK WHERE     email =
If we find no rows, then @PKToUpdate will be NULL.  So we can use that knowledge to build our upsert.
IF (@PKToUpdate IS NOT NULL)     UPDATE TroubledTableWithPK SET         FirstName =         , LastName =         , email =     WHERE         ttPrimaryKey = @PKToUpdate ELSE          INSERT INTO TroubledTable     (FirstName, LastName, email)     VALUES , ,
Easy, right?  Either of these methods work really well for the single row at a time system, such as a web form built for one user at a time to come in and register.  But what if you had a system that allowed bulk inserts?
This is where the MERGE statement shines.  This is a much more elegant solution to handling all three operations in a single statement: INSERT, UPDATE, and DELETE.  If you haven't seen this one before, don't worry about it.  It's all concepts you already know, just put together in a new way!  Check it out.
We first have to think about our data as having two tables: a source table and a target.
In our example we're going to use my OriginalTroubledTable, a backup I made of TroubledTable as my source. It has the same structure as TroubledTable, so it makes things a bit easier. Now that we have a source and a target, we have to decide how we'll match rows between the two. Just like a JOIN statement. When doing UPSERTS, you're generally going to want to want the JOIN (or match) criteria to be the information that makes each row unique. In our case we don't want duplicate email addresses. So we're going to join on email from each table.
So far we have this much of our MERGE statement written by coming up with the source, target and JOIN criteria.
MERGE INTO TroubledTable AS [target] USING OriginalTroubledTable as [source] ON [target].email = [source].email
Next, we have to decide what we're going to do when we have a match, when we don't have a match in the target, and what we want to do if there's no match in the source. All three of these are things you can handle with the MERGE statement.
So, in the case where we have a match on our join criteria, we will want to do an UPDATE. If we don't have a match in the target, we're going to do an INSERT.  If we don't have a match in the source, we're not going to do anything, so at least that's easy!
Here's how we construct that MERGE statement:
MERGE INTO TroubledTable AS [target] USING OriginalTroubledTable as [source] ON [target].email = [source].email WHEN MATCHED THEN UPDATE SET [target].FirstName = [source].FirstName , [target].LastName = [source].LastName WHEN NOT MATCHED BY TARGET THEN INSERT (FirstName, LastName, email) VALUES ([source].FirstName, [source].LastName, [source].email);
So when we run this, any records missing from the target are added. Any records that already exist will have their first and last names updated. And we're all good right?
Msg 8672, Level 16, State 1, Line 1 The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
That's right, Merge doesn't handle it if you pass it if your source table has duplicates in it! You might want to check out my de-duping articles.
Either that, or you could use a sub query for the source.
MERGE INTO TroubledTable AS [target] USING ( SELECT DISTINCT FirstName, LastName, email FROM OriginalTroubledTable) as [source] ON [target].email = [source].email --now we're only looking to make emails distinct WHEN MATCHED THEN UPDATE SET [target].FirstName = [source].FirstName , [target].LastName = [source].LastName --no need to include email, since that's our join criterion WHEN NOT MATCHED BY TARGET THEN INSERT (FirstName, LastName, email) VALUES ([source].FirstName, [source].LastName, [source].email);
Learning to read between the lines when you get a SQL error will go a long way in your career! This is just one chance to do that.
These are my two patterns for dealing with upserts, If you have another way, please let me know! I'm always interested in seeing new ways to do things.
As always if you have any questions about this, or anything else in SQL, let me know! I'm here to help.
Original Article
0 notes
shannonlowder · 13 years ago
Text
De-duping by CTE
Ok, today we wrap up our design patterns for removing duplicates from tables by turning to common table expressions (CTE's). When I start talking CTE's some people begin to cringe in fear. I'll admit, I still have to refer to a reference when doing something I haven't done before with a CTE, which is pretty much de-duping and building hierarchical lists. But they're useful, and they can have a light load on your SQL server in some situations, and de-duping is one of those situations, so let's dive right in!
First we're going to make sure we have our troubled table exists. If you don't have it, grab my script here to create it. Then take a look at the duplicates you have.
SELECT email, COUNT(*) FROM TroubledTable GROUP BY email HAVING COUNT(*) > 1
We're going to look at a email address that's been duplicated in our table.
SELECT * FROM troubledtable WHERE email = '[email protected]'
I'm looking at 7 copies of that email address. Now we're going to create a CTE that gives every copy of an email address a distinct row number. Then we can use that row number to identify the "extra" copies, in our case any record that has a row number greater than 1.
Let's first look at the query that gives us the row numbers for each copy.
SELECT ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN , Firstname, lastname, email FROM troubledtable WHERE email = '[email protected]' --RESULTS: /* RN Firstname lastname email --- --------- --------- ----------------------------- 1 Abra Romero [email protected] 2 Abra Romero [email protected] 3 Abra Romero [email protected] 4 Abra Romero [email protected] 5 Abra Romero [email protected] 6 Abra Romero [email protected] 7 Abra Romero [email protected] 8 Abra Romero [email protected] */
Now, if we define that query as our CTE we can add another where clause that will show us only the "extra" rows. Please notice I start with the semi-colon. Do that, so the interpreter will understand your query. If you miss it, the interpreter might try to include a previous query, and that just won't work.
;WITH cte AS ( SELECT ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN , Firstname, lastname, email FROM troubledtable WHERE email = '[email protected]' ) SELECT * FROM cte WHERE RN > 1 --RESULTS: /* RN Firstname lastname email --- --------- --------- ----------------------------- 2 Abra Romero [email protected] 3 Abra Romero [email protected] 4 Abra Romero [email protected] 5 Abra Romero [email protected] 6 Abra Romero [email protected] 7 Abra Romero [email protected] 8 Abra Romero [email protected] */
Those are the records we want to delete. So if we change our SELECT from CTE to A DELETE from the CTE, we're good to go!
;WITH cte AS ( SELECT ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN , Firstname, lastname, email FROM troubledtable WHERE email = '[email protected]' ) DELETE FROM cte WHERE RN > 1 --(7 row(s) Affected)
Here's the really cool part of using a CTE, the delete was actually performed against the base table. Check out the execution plan for the proof!
SELECT email, COUNT(*) FROM TroubledTable WHERE email = '[email protected]' GROUP BY email HAVING COUNT(*) > 1 --(0 row(s) affected)
Pretty awesome, right? Now, let's remove the where clause and de-dupe the whole table.
;WITH cte AS ( SELECT ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN --, Firstname, lastname, email -- not needed for the delete, just to show the data. FROM troubledtable) DELETE FROM cte WHERE RN > 1 --see they're all gone: SELECT email, COUNT(*) FROM TroubledTable GROUP BY email HAVING COUNT(*) > 1
Using CTE's to remove dupes can be a pretty cool way of doing the job. If you're only removing a small percentage of the rows in your table, the CTE can actually reduce your locking from a table lock to a page or row lock.
As with everything, you need to choose the method that works best for your situation. You'll have to make the judgement call on when each solution will work best for you.
That's it for the design patterns for removing duplicates. Next time we're going to cover a way of preventing duplicates from being inserted in the first place by performing upserts, instead of just inserting new data!
Until then, if you have any questions, please let me know! I'm here to help.
Original Article
0 notes