Sunday, 30 November 2008

File fragmenation

Here is something administrators and system architects offend over look is file fragmentation, many believe this doesn't impact their systems enough to be worth factoring this into the design but the long term affect can have high impact on the performance and even prevent service.

To prevent this we have to first understand how this happens, lucky this isn't rocket science in fact it quite simple imagine that your harddisk is a book shelf when all is working well it should be in alphabetical order, but if they isn't room in the place where the book should go it will take the first available place on the shelf, after doing this a few hundred times with the read and write of data you end up with many books in places that make it harder to find thus pushing the seek time up and on large files because the file can be located over many parts of the drive you can end up with I/O timeout on file read operations.

On SQL server it can lead to tasks and operations such as backups failing, on file servers is can lead to files not opening on first request and general complaints from users regarding performance of the servers.

One of the quickest ways to prevent this from happening is to not mix the data type and applications that have access to a volume.

If you have many applications placing data on a volume you can quickly have a fragmented drive, one of the best example I have come across was a backup of SQL on the same volume as the database is self, this doesn't sound like much of an issue at first however if i explain all will become clear.
The databases is growing and the backups are being recreated every day, so imagine you have no enough space when database is growing to create the data at the end of the file so it creates it just after the end of the backup then later that days the file backup is deleted and recreated and is also larger and fills up the same space plus some more at the end of the database again.
This results in what I call the leap frog affect as data is always adding at the end, and quickly the volume was 98% fragmented and SQL had I/O error with operations timing out because they where taking longer then 15 seconds.

I know what your thinking this was some old hardware and this is why it made such a big difference... well you'd be wrong this was new hardware 6th generation and the data was on SAN drive that as you know are faster then local drives by up to twenty times, and we are talking about a database that was only 120GB, now bearing in mind I've seen 2.8TB SQL databases on the same hardware without issue, so this really begins to show how much it can impact the server.

SQL Server has encountered 21 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\mssql\data\testdb_log.ldf] in database [testdb]

On file servers the same can happen and it appears as SCSI I/O errors, where windows is not getting the data as quickly as it expects, now on file servers the problem is more complicated as you a file share is a file share there isn't allot that can be done or is there?

Well you can do some steps part from having regular defragmentation of the volume you can also split the shares so the application shares and user share are on separate volumes.

So to quick bullet points for you.

Separate System, Data, Log and backups for SQL servers.

Separate file shares into type of share and have regular defragmentation of those volumes.

Create volumes based on uses and consider how data will react with many write operations.

Tuesday, 11 November 2008

Maintenance task fails on reorganizing indexes

The most common problem with in house or bespoke databases is developers forgetting to turn page locks on, now this doesn’t sound like the largest problem, and can easily be fix with a simple query like this.

ALTER INDEX [name_of_index] ON [dbo].[name_of_table] REBUILD WITH ( PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, SORT_IN_TEMPDB = OFF, ONLINE = OFF )

However the error output from job may not clearly tell you what database it came from if you have a large number of databases with similar or same table names in them or just to many to check each one by one then you'll be happy to know there is a quicker way.

So here is another the way around the problem, I have created a query that will list all the databases, tables and indexes that have no page locks making your task of correcting the problem much easier.


SET NOCOUNT ON

DECLARE @CurrentDB NVARCHAR(128)

IF OBJECT_ID('tempdb..#tblServerDatabases', 'U') IS NOT NULL
DROP TABLE #tblServerDatabases
-- create temp table
CREATE TABLE #tblServerDatabases (DBName NVARCHAR(1024))

IF OBJECT_ID('tempdb..#tblDBindexes', 'U') IS NOT NULL
DROP TABLE #tblDBindexes
-- create temp table
CREATE TABLE #tblDBindexes (
DBName NVARCHAR(1024),
[index_name] NVARCHAR(1024),
[index_id] NVARCHAR(1024))

IF OBJECT_ID('tempdb..#tblDBtables', 'U') IS NOT NULL
DROP TABLE #tblDBtables

-- create temp table for finding object id in tables
CREATE TABLE #tblDBtables (
DBName1 NVARCHAR(1024),
[table_name] NVARCHAR(1024),
[table_id] NVARCHAR(1024))


-- insert into temp table and names of databses you don't want to list
INSERT INTO #tblServerDatabases (DBName)
SELECT [name]
FROM master.dbo.sysdatabases
WHERE [name] NOT IN ('model', 'tempdb')

SELECT @CurrentDB = min(DBName)
FROM #tblServerDatabases

WHILE @CurrentDB IS NOT NULL
BEGIN
INSERT INTO #tblDBindexes(
index_name, index_id)
EXEC ('USE ' + @CurrentDB + ' select [name] as index_name, [object_id] as index_id from sys.indexes where allow_page_locks=0')

UPDATE #tblDBindexes
SET DBName = @CurrentDB
WHERE DBName IS NULL

SELECT @CurrentDB = min(DBName)
FROM #tblServerDatabases WITH (NOLOCK)
WHERE DBName > @CurrentDB
END

-- fill temp table with results of query for table object id
SELECT @CurrentDB = min(DBName)
FROM #tblServerDatabases

WHILE @CurrentDB IS NOT NULL
BEGIN
INSERT INTO #tblDBtables(
table_name, table_id)
EXEC ('USE ' + @CurrentDB + ' select [name] as table_name, [object_id] as table_id from sys.tables')

UPDATE #tblDBtables
SET DBName1 = @CurrentDB
WHERE DBName1 IS NULL

SELECT @CurrentDB = min(DBName)
FROM #tblServerDatabases WITH (NOLOCK)
WHERE DBName > @CurrentDB
END

SET NOCOUNT OFF

-- join temp tables and show results
select [dbname],index_name,table_name
FROM #tblDBindexes
join #tblDBtables
ON #tblDBtables.[table_id]=#tblDBindexes.[index_id]

-- Cleanup
IF OBJECT_ID('tempdb..#tblServerDatabases', 'U') IS NOT NULL
DROP TABLE #tblServerDatabases

IF OBJECT_ID('tempdb..#tblDBindexes', 'U') IS NOT NULL
DROP TABLE #tblDBindexes

IF OBJECT_ID('tempdb..#tblDBtables', 'U') IS NOT NULL
DROP TABLE #tblDBtables

This script has already filtered out the system indexes leaving you only the ones created by the designer from this list you should be able to find the database where the index creating the error in the reorganizing exists.

Sunday, 2 November 2008

Physical Security

Following on from the last posting we are going to talk about security, first physical security… now most of you think keys and locked rooms when someone says physical security or worse yet mace spray.

Sadly I’m not talking about your personal security, although you should take some steps there as well to secure it.

We are in fact talking about the access to servers and consequently your information, the majority of you will have taken some steps to lock your servers and switches away to prevent Joe public from touching them, but will this be enough to stop the experienced data theft?

Truthfully no as you have locked away the hardware but not what its connected too. As long as there RJ45 port comes out into the office somewhere any device can be connected to it.

So today we are going to talk about using 802.1x to stop unauthenticated PC’s being connected first of all, and move on to port security for the second part to reduce the chance of a (MITM) Man In The Middle attack.

To begin with you’ll need to setup IAS on your server and put the IP address of the switch and the passkey you intend to use for the switch and IAS service to authenticate with one another on, we will assume you have done this already. If you need help on this you can goto http://technet.microsoft.com/en-us/network/bb643123.aspx where you will find how to configure the IAS service.

In today’s example we’ll be using a Cisco switch 12.2 IOS using 802.1x
802.1x requires a device to authenticate itself with the switch before the switch will forward any packets to or from the device. This is an important first step for good network security.

aaa new-model
aaa authentication login default group radius
aaa authentication dot1x default group radius
dot1x system-auth-control
radius-server host 192.168.0.44 auth-port 1812 acct-port 1813 key IASPaSSwOrd123

Now that the global setting is in place you’ll need to apply it to the interfaces, see below how I’ve configured interface 0/1 to use the server authentication.


interface FastEthernet0/1
switchport mode access
switchport port-security
switchport port-security maximum 3
dot1x port-control auto
spanning-tree portfast
spanning-tree bpduguard enable
speed auto

Now that we have set authentication on the port we can look at some other things we can do to improve the security still further.
With port security enable it causes the switch to learn the MAC address of the device plugged into the port. If the MAC address changes (e,g. someone plugs in a foreign machine) the switch shuts the port down

interface FastEthernet0/3
switchport mode access
switchport port-security
switchport port-security
dot1x port-control auto
spanning-tree portfast
spanning-tree bpduguard enable
speed auto

Now this is good but might not be the best way you might have decided to have more than one PC so you can allow the switch to learn more than one MAC address by changing the setting, in this case I’ve chosen for it to remember three MAC addresses.

interface FastEthernet0/1
switchport mode access
switchport port-security
switchport port-security maximum 3
dot1x port-control auto
spanning-tree portfast
spanning-tree bpduguard enable
speed auto

Another alternative if you don’t want to be manually turning the ports back on every time you change a bit of hardware is you can set a time out where the port will re-enable its self after a given time, this time is counted in minutes so as an example 600 minutes equals 10 hours .
Here is an example of port 5/1 re-enabling after 10 hours of shutdown.

set port security 5/1 shutdown 600
note this is set at a global level not at interface level.

This automatic re-enabling of the ports is good for preventing you from losing all the ports on the switch because of some re-cabling or some user plugging and unplugging some un-permitted device into the switch, but it defeats the point if the time-out is too low so try to keep it around the 8 hour or high mark as this means that a user that done something wrong will have to report it and a data theft will have only one chance a day to try and connect forcing him or her to wait hours or place a high profile call to helpdesk to get it re-enabled.

Saturday, 1 November 2008

Overview To Disaster Recovery

The majority people talk about disaster recovery and restoring the systems to similar or same hardware and the reason for this is almost 50% the fault of the operating system not being flexible to hardware changes and the over 50% is the fault of the backup software.

A good number backup software programs restores to the same hardware and leaves you to fix any hardware related issues, some backup software vendors have however have developed add-on modules that will do this for you and can drastically improve your disaster recovery time.

However we all know if you spend enough money anything is possible and I’m not writing about what software is best, as with all things there is more than one way to skin a cat and today I’m going to show you how solve the above mentioned problem by design.

Now the way we are going to approach this problem is one of design, we are going to admit that we have some weaknesses in our network and instead of relying on software to save us we are going to design our way out of the issues encountered.

Let’s start at the beginning in a live environment the very first thing you need to restore is the domain, now for a lot of disaster recover the domain is the make or break part if you can’t restore the domain then you have nothing as its pointless having an application of file that you can’t open.

“But I can open files without a domain I hear you say….” Oh dear comes my reply if you can open files without a domain then you have a rather large security problem but this is something I’ll address later in another posting.


Domain
So let’s get down to business we know that we need the most stable and flexible solution for our domain controllers. In a nut shell virtualization, domain controllers are generally processor heavy in large environments but not allot else, I’ve seen a 90,000 account domain run on a server with only 1GB of memory.

So since these servers are not so resource hungry they naturally lend themselves to being virtualized, this also removes dependency on hardware for the domain controller and can make it recoverable to any hardware that can run the image making you vendor independent for the hardware, hypervisor is the key word in all this.

Having one or more forms of virtual technology’s available to you also lets you do your own DR test and ion out the bugs without having the real disaster.
Virtualbox is a nice free one from Sun Microsystems if you wondering where to start or just don’t want to buy new software.


Network Access
Static IPs are not always a good thing and in a disaster they can be a real problem, example when hardware changes the new network adapter is detected and thus all IP addressing is lost, you have to options first is to use DHCP where possible and this I recommend to use as much as you can.

The second is to use netsh to backup and restore the IP configuration info to a common name such as “LAN” or some other common name for the interface so that all you have to do is rename the interface to LAN and execute netsh exec c:\ip.txt to restore your IP this is extremely handy if you have 6 or more network cards and you don’t want to spend half hour setting them up.

network card can often be an issue in a restore if you have the space keep a store of network card drivers on the system disk so you can easily reinstall the network card driver... its quite common for network cards to be intel based these days so the store of drivers shouldn't be large and this can be easy to implement.

Access to the network resource via IP and policy based security can also be time consuming to restore, so backing your IAS, RAS and DHCP scopes file using netsh so a way to save some time on the restore, after all you can make 30 or 40 clicks or you can just type netsh exec file.txt now this might not sound like that much of a time saver at first but when you have 8 to 10 server this can save more than an hour and cut the time spent down to less than a minute if you also have PSTools you can execute the restore simultaneously on all servers.


Scripting
Scripting is an important part of IT not just documenting how tasks are done but using the tools to hand to reduce the time taken to do them.

Imagine for an example I wanted to install a windows hotfix on 50 workstation and I didn’t have WSUS or internet access on the LAN segment for whatever reason do I spend 10 minutes logging on to each workstation and installing said hotfix and ultimately lose several precious hours or do I sit at my desk and execute c:\pstools\psexec @C:\workstationlist.txt “\\fileserver\share\critical\patch.exe /quiet /passive /norestart” and thus never have to leave the my chair.

This kind of scripting can be used in restores to just as much efficiency, as it maybe need redeploy patches and updates that are missing from the restore media.
Remember whatever tools use choice to use, it should be use on the entire environment as differing restore techniques only add to the complications and time needed to restore the environment.


Applications
Now we have finished on the scripting side of things move on to the applications, using Active Directory based authentication is not just easier to centrally administer but also better for application restores as it saves time on restoring SQL based logins where this is mismatch in SID, along with file permission and other such lovely security identifier related issues, so make a check on all existing application that they use domain authentication and identify any that might be a problem to restore.

Another largely over looked point is some application are outside of backup software’s ability to restore so look into best practices of each software vendor you use and check the application has recover path as you might need to delete locked files and or restore the application using its own set of backup tools.