We are starting a series of “OpsMgr by Example” blog posts. The intent is to provide both the 5,000 foot/meter perspective on what to do and show the details for a particular type of tuning performed in an example deployment.
This first one is related to an Exchange MP deployment (not covering all of Exchange). We hope to follow-up with OpsMgr by Example posts for the AD MP, SQL MP, and potentially the Exchange Management Pack.
Let's begin by talking about configuring baselines. While tuning the OpsMgr 2007 Exchange 2003 Management Pack, the majority of the alerts generated were a result of the calculated baseline rules. The following is a step-by-step for how to configure the sensitivity of these rules to decrease the alert volume. But first a huge thanks to the explanation on this which started us down the path on how to do this tuning:
The following were the primary alerts that caused large amounts of volume in the OpsMgr environment:
Information Store Transport Temp Table is outside the calculated baseline
Mailbox Store Send Queue is outside the calculated baseline
SMTP Local queue is outside calculated baseline
SMTP Messages in the Queue Directory is outside calculated baseline
SMTP Remote Queue is outside the calculated baseline
SMTP Remote Retry Queue is outside the calculated baseline
IS Virtual Bytes is outside the calculated baseline
Number of RPC requests is outside the calculated baseline
Perform the following steps for all Alerts that are causing significant volume in your environment. Implement these one at a time. We recommend following the order listed, as it groups together the types of rules to make them easier to find. The steps that refer to the Exchange Queue will vary depending upon the rule and monitor changed. The first six alerts above are all part of the Exchange Queue, the last two are part of Exchange IS Service. Change each on both the monitor and rule level.
Also, we strongly recommend you save your changes to an unsealed MP other than the Default management pack.
Steps to resolve: (perform all of these steps for each Alert in your environment which needs to be tuned)
Find the rule that applies to the alert. (To find the rules, it’s easiest to change the scope to filter by the two areas that we need - which are the Exchange Queue and Exchange IS Service. Both of these are available when you click on scope and choose the option to view all targets. Then find rules with “Baseline Collection” as the start. This scopes it down to about 17 rules versus over 6000.) Details on the names of each of the above rules are listed below. Disable the rule (Right-click on the rule, overrides, disable the rule for all objects of type: Exchange Queue, click yes to accept).
Change the rule sensitivity to 2.81 (Right-click on the rule, Overrides, Override the rule, For all Objects of type: Exchange Queue, check the Sensitivity parameter and set it to 2.81 if it’s not already set to that value, click OK).
Find the monitor that applies to the alert. This can be found by searching or scoping to the type of object identified for the rule. Disable the monitor (Right-click on the monitor, Overrides, Disable the monitor for all objects of type: Exchange Queue, click yes to accept).
Change the monitor inner sensitivity to 2.81 (Right-click on the monitor, Overrides, Overrides the monitor, For all Objects of type: Exchange Queue, check the Inner Sensitivity parameter and set it to 2.81 if it’s not already set to that value, click Ok).
Change the monitor outer sensitivity to 3.31 (Right-click on the monitor, overrides, Overrides the monitor, For all Objects of type: Exchange Queue, check the Outer Sensitivity parameter and set it to 3.31 if it’s not already set to that value, click Ok).
Re-enable the monitor. (Right-click on the monitor, click on Overrides Summary, delete the override that says Type, Exchange Queue, Enabled, False).
Go back to the rule identified in step #1 and re-enable the rule. (Right-click on the rule, click on Overrides Summary, delete the override that says Type, Exchange Queue, Enabled, False).
Mapping for Alerts, Rules and Monitors:
ALERT=Information Store Transport Temp Table is outside the calculated baseline
RULE=Baseline Collection Rule for Information Store temp table number of entries (Rules, of type Exchange Queue)
MONITOR=IS Transport Temp Table Monitor (Exchange Queue, Entity Health, Performance)
ALERT= Mailbox Store Send Queue is outside the calculated baseline
RULE=Baseline Collection Rule for Mailbox Store Send Queue Length (Rules, of type Exchange Queue)
MONITOR=MB Store Send Queue Monitor (Exchange Queue, Entity Health, Performance)
ALERT=SMTP Local queue is outside calculated baseline
RULE=Baseline Collection Rule for SMTP Server Local Queue (Rules, of type Exchange Queue)
MONITOR=SMTP Local Queue Monitor (Exchange Queue, Entity Health, Performance)
ALERT=SMTP Messages in the Queue Directory is outside calculated baseline
RULE=Baseline Collection for SMTP Message Queue Directory (Rules, of type Exchange Queue)
ALERT=IS Virtual Bytes is outside the calculated baseline
RULE=Baseline Collection Rule for IS Virtual Bytes (Rules, of type Exchange IS Service)
MONITOR=IS Virtual Bytes Monitor (Exchange IS Service, Entity Health, Performance)
ALERT= Number of RPC requests is outside the calculated baseline
RULE=Baseline Collection Rule for IS RPC Requests (Rule, of type Exchange IS Service)
MONITOR=IS RPC Requests Monitor (Exchange IS Service, Entity Health, Performance)
UPDATE for the IS Virtual Bytes and RPC requests:
After configuring the inner and outer sensitivities (Steps 4 and 5 in that article), several of the rules were still generating large volumes of alerts. Those rules include:
IS Virtual Bytes is outside the calculated baseline
Number of RPC requests is outside the calculated baseline
The alerts were identified as "Above Inner Envelope." To minimize their frequency, we previously changed both the rule and the monitor's sensitivity from 2.81 to 3.31 on the overrides.
From reading up on this sensitivity concept, it appears that increases to this value decreases the frequency of the alerts, as it decreases the sensitivity to the difference from the calculated baseline.
In theory, if the 3.31 override was not sufficient, then one should next try 3.81; this is because the increase from 2.81 to 3.31 is an increase of .5; therefore another .5 increase seems logical if another value change is required. This is an extrapolation based on what we have seen so far, as we do not know the internal workings of the algorithm!
The feedback we have seen indicates that 3.81 works quite well.
We've talked about moving the OpsMgr database from one SQL Server to another machine, but what if you want to keep it on the same server but need to move it to another drive? You may decide you want to do this if you have a second disk drive with more space, or to move to another spindle for better performance.
In the example below, we will move the database from the D: drive to the E: drive. The database file names are the default names of OperationsManager.mdf and OperationsManager.ldf. If you changed the database name as part of your OpsMgr installation, the filenames will also be different.
To determine the location of the files for the OperationsManager Database, log into SQL Server Management Studio, connect to the server running the OperationsManager Database, click on the OperationsManager database and right-click New Query and enter: sp_helpfile. The filename will show the location (and names) of the OperationsManager database files.
First, stop the OpsMgr services on the RMS (OpsMgr Config Service, OpsMgr Health Service, OpsMgr SDK Service). Also stop the OpsMgr Health Service on any other management servers.
Backup the OperationsManager and the master database as well, just to be safe. This can be done using the SQL Server Management Studio; select the OperationsManager database in the left pane, right-click, select Tasks, select Back Up... and follow the instructions. Be sure to do a full backup. Do the same for the master database.
Detach the OperationsManager database: In a SQL query select the master database and type: sp_detach_db 'OperationsManager' (Don’t highlight the OperationsManager database on the left pane or it will not detach because it is in use! Also the database will not detach because it is in use if any management servers are running the OpsMgr Health Service shut down).
Using Windows Explorer, copy the data and log files from the current location to the new drive/location. We are assuming that the location is E:\Sqldata.
Re-attach the database. In a SQL Query window select the master database and type: sp_attach_db 'OperationsManager', 'E:\Sqldata\OperationsManager.mdf', 'E:\Sqldata\OperationsManager.ldf'
Verify it worked using sp_helpfile. Select the OperationsManager database and in query window, type: sp_helpfile. The filename column returned from sp_helpfile should reflect the new locations.
Restart the services on the RMS and other management servers and validate functionality of the new database.
This entry was designed to provide specifics as to how this is done for the OperationsManager database, see http://support.microsoft.com/kb/224071 for general processes regarding database moves.
Kerrie and Cameron co-presented at TechEd 2007. The video is available at mms://techedmsftwm.fplive.net/techedmsft/2007/ARC311.wmv. In the presentation we discuss what’s new in OpsMgr 2007, Methodology, Server Roles/Components, Simple and Complex Operations Manager Architectures, The Six Key Questions, SCCP Demo, Case Studies, and wrap-up with a Q&A session.
Cameron went through the video and broke out the topics from a timing perspective. The presentation is about 1 hour and 20 minutes, and you can find the topics in the video file as follows:
Topic
Placement in wmv
Introductions & Agenda
0-6 Minutes
What's New
7 minutes
Methodology
19 Minutes
Database Sizing
34 Minutes
Server Roles/Components
46 Minutes
Simple and Complex Operations Manager Architectures
51 Minutes
The Six Key Questions
54 minutes
SCCP Demo
55 minutes
Case Studies
59 minutes
Q&A
61 minutes
Q&A topics included discussions on:
.NET 1.1 and the RMS
handling large numbers of distributed apps
MOM 2005 MPs to Ops2007 MPs conversion
Virtual Machines and OpsMgr
the Clustering Management pack
etc.
There are some quiet times during which people were not on the microphones.
Much of this information is also available in greater detail in our forthcoming book, System Center Operations Manager 2007 Unleashed.
In earlier versions of OpsMgr, Microsoft provided instructions for moving the operational database to another database server (see http://support.microsoft.com/kb/917894 and http://support.microsoft.com/kb/297771). This information is not yet available for Operations Manager 2007, so we are providing the following steps:
Stop all OpsMgr services on the Root Management Server. If you have multiple management servers, you will need to stop the services on those machines as well.
Using SQL Server Management Studio, connect to the source database server and backup the OperationsManager database.
Connect to the destination server and create Windows/AD SQL logins for the following OpsMgr accounts : SDK, MSAA, and DWWA.
Copy the OperationsManager database backup file to the destination database server, and restore the database to the destination database server.
Using SQL Server Management Studio on the destination database server, right-click on the OpsMgr SDK login and go to properties.
In the properties for the Operations Manager account, go to the User Mapping page and click on the OperationsManager database. Ensure the following database roles have been assigned to the SDK account:
Db_datareader
Db_datawriter
Db_ddladmin
Db_owner
Dbmodule_users
Sdk_users
Click OK.
On the RMS and each of your management servers, open REGEDIT and browse to HKLM\Software\Microsoft\Microsoft Operations Manager\3.0\Setup, update the string called DatabaseServerName to reflect the name of the new database server.
Reboot the RMS and other management servers.
Many thanks to Kendra Thorpe of Vanderbilt University for helping with this information!
Modifying grooming settings for the OpsMgr databases
OpsMgr 2007 includes three databases: Operational, Data Warehouse, and the ACS database (assuming you have implemented ACS). We discuss how to change the grooming settings for each.
Operations database
The Operational database is the most straightforward of the three, as you can change the settings in the Operations console in the Administration node, under Settings -> General -> Database Grooming. Note that there are multiple data types, each having its own setting. This becomes more significant when we look at grooming the data warehouse.
The default setting for each of the data types is to remove or groom the data after seven days.
Data Warehouse database
The console does not have an interface to modify data retention settings for the data warehouse. You can groom the data warehouse settings by modifying columns in the StandardDatabaseAggregation table in the Data Warehouse database.
By default, data is groomed out at different intervals depending on the degree of aggregation. Data is stored by type, and the ranges for data retention vary from 10 days to 400 days depending on the range of data. You can view the grooming settings by running the following SQL query:
USE OperationsManagerDW
SELECT AggregationIntervalDurationMinutes, BuildAggregationStoredProcedureName, GroomStoredProcedureName, MaxDataAgeDays, GroomingIntervalMinutes, MaxRowsToGroom FROM StandardDatasetAggregation
The default settings returned by this query are displayed in the following table:
AggregationInterval DurationMinutes
BuildAggregationStored ProcedureName
GroomStored ProcedureName
MaxData AgeDays
GroomingInterval Minutes
MaxRows ToGroom
NULL
NULL
EventGroom
100
240
100000
NULL
NULL
AlertGroom
400
240
50000
NULL
NULL
StateGroom
180
60
50000
60
StateAggregate
StateGroom
400
60
50000
1440
StateAggregate
StateGroom
400
60
50000
NULL
AEMAggregate
AemGroom
30
240
100000
1440
AEMAggregate
AemGroom
400
240
100000
NULL
PerformanceAggregate
PerformanceGroom
10
240
100000
60
PerformanceAggregate
PerformanceGroom
400
240
100000
1440
PerformanceAggregate
PerformanceGroom
400
240
100000
To make some sense of this, consider the following:
The first column (AggregationIntervalDurationMinutes) is the interval in minutes that data is aggregated. NULL is raw data, 60 is hourly, and 1440 is daily. You can see that some performance data is not aggregated at all, some is on an hourly basis, and some is daily.
MaxDataAgeDays is the maximum number of days data is retained. Depending on the type of data and its degree of aggregation, defaults can range from 10 to 400 days. This is the value we will be modifying, based on the particular type of data (Events, Alerts, State, AEM, or Performance data), and level of aggregation.
GroomingIntervalMinutes is the grooming process frequency, or how often the groomed stored procedure runs. Performance, Alert, Event, and AEM data is groomed every 240 minutes (4 hours); the procedure to groom State data runs every hour.
MaxRowtoGroom is how many rows are processed in a given execution of the grooming procedure.
As an example of how this works, let’s look at non-aggregated Event data, which is the first row of information. We know that this pertains to Events information because of the referenced procedure name EventGroom (GroomStoredProcedureName). The query tells us that Event data is not aggregated (AggregationIntervalDurationMinutes=NULL) and is saved for 100 days (MaxDataAgeDays). The EventGroom procedure grooms data (GroomStoredProcedureName), and runs every 240 minutes/4 hours (GroomingIntervalMinutes). Each time the stored procedure runs, it will groom a maximum of 100,000 rows.
The following SQL code changes the grooming frequency for Event data:
USE OperationsManagerDW
UPDATE StandardDatasetAggregation
SET MaxDataAgeDays = <number of days to retain data>
WHERE GroomStoredProcedureName = 'EventGroom'
Changing the retention period for Event data is relatively easy; since it is never aggregated, it only has one row in the StandardDatsetAggregation table. The following syntax can be used as a basis for updating retention periods for other types of data:
USE OperationsManagerDW
UPDATE StandardDatasetAggregation
SET MaxDataAgeDays = <number of days to retain data>
WHERE GroomStoredProcedureName = '<procedure name>' AND AggregationIntervalDurationMinutes = '<aggregation interval duration>'
Using the above example, let's look at how to to change the grooming frequency for hourly groomed performance data as an example of how to change retention for a data type that has multiple aggregation levels. We will update the grooming period for Performance data that is aggregated on an hourly basis:
USE OperationsManagerDW
UPDATE StandardDatasetAggregation
SET MaxDataAgeDays = <number of days to retain data>
WHERE GroomStoredProcedureName = 'PerformanceGroom' AND AggregationIntervalDurationMinutes = '60'
*** As an update, if you are running OpsMgr 2007 SP1 and OpsMgr 2007 Reporting SP1, you can use the Data Warehouse Data Retention Policy Tool (dwdatarp.exe) to view and configure data warehouse data retention policies. Instructions for this command line tool and the download are available at http://blogs.technet.com/momteam/archive/2008/05/14/data-warehouse-data-retention-policy-dwdatarp-exe.aspx. Thanks to Daniel Savage of the OpsMgr team for providing this! ***
ACS database
Data is groomed out of the ACS Database based on the data retention period specified during setup, with the default being 14 days. After looking at what it takes to groom the data warehouse, ACS is relatively simple! The ACS Collector calls a SQL procedure to remove a partition that is outside of the data retention period. This procedure can be found at %SystemRoot%\system32\security\adtserver\DbDeletePartition.sql. The data retention period itself is specified in the dtConfig table in the OperationsManagerAC database.
To update the data retention period, run the following SQL query:
USE OperationsManagerAC
UPDATE dtConfig
SET Value = <number of days to retain data + 1>
WHERE Id = 6
To retain 7 days of data, set <Value> = 8. Data is accumulated at approximately 7.6 MB per day per workstation.
Ok, so you want to have an idea how large your Operations Manager databases are going to be. And you want it to be a little more exact than “larger than a gigabyte, smaller than a terabyte.” We have been working on this question for a while now and have developed the following methodology that provides high-level estimates for database sizing for the three databases in OpsMgr: Operations DB, Data Warehouse DB, and ACS DB.
First, the caveats. These are estimates based upon our testing results and may not take into considerations specific aspects of your environment such as large numbers of management packs, customized management packs, extremely chatty management packs,or other factors which may impact your database size. With that said however, this is how we came up with our numbers.
OpsDB: We installed a new version of each database and determined what amount of space was used when it had been installed but prior to when agents began reporting data. (This means it had configuration data with minimal operational data.) We monitored what tables were increasing to validate that it was directly related to both the retention period of the data being held and the number of agents providing information. From these we were able to determine an approximate impact on a per-agent basis to the database size, which was 5 MB/day. We added an approximately 40% contingency factor to come up with the estimated sizing. The following is the resulting formula:
(5 MB/day x Number of Agents x Retention Days) + 510 MB = Operations Manager Database Estimate
Grooming Interval (days)
3000 Agents
2000 Agents
1000 Agents
500 Agents
100 Agents
50 Agents
10 Agents
1
15510
10510
5510
3010
1010
760
560
2
30510
20510
10510
5510
1510
1010
610
3
45510
30510
15510
8010
2010
1260
660
4
60510
40510
20510
10510
2510
1510
710
5
75510
50510
25510
13010
3010
1750
760
6
90510
60510
30510
15510
3510
2010
810
7
105510
70510
35510
18010
4010
2260
860
8
120510
80510
40510
20510
4510
2510
910
9
135510
90510
45510
23010
5010
2760
960
10
150510
100510
50510
25510
5510
3010
1010
Data Warehouse DB: This was done using the same concept as the OpsDb with a similar level of contingency. The following is the resulting formula:
(3 MB/day x Number of Agents x Retention Days) + 570 MB = Data Warehouse size estimate
Retention Period
3000 Agents
2000 Agents
1000 Agents
500 Agents
100 Agents
50 Agents
10 Agents
1 month
270570
180570
90570
45570
9570
5070
1470
2 month
540570
360570
180570
90570
18570
9570
2370
1 qtr
810570
540570
270570
135570
27570
14070
3270
2 qtr
1620570
1080570
540570
270570
54570
27570
5970
3 qtr
2430570
1620570
810570
405570
81570
41070
8670
1 yr
3285750
2190570
1095570
548070
110070
55320
11520
5 qtr
4095570
2730570
1365570
683070
137070
68820
14220
6 qtr
4905570
3270570
1635570
818070
164070
82320
16920
7 qtr
5715570
3810570
1905570
953070
191070
95820
19620
2 yr
6570570
4380570
2190570
109570
219570
110070
22470
ACS DB: This one was more complex due to the factors involved with different types of systems reporting to ACS (workstations, servers, domain controllers). To factor in these differences we put an approximate weight to each of these (1 to a workstation, 5 to a server, 100 to a domain controller). The following is the resulting formula:
(8 MB/day x (Number of Workstations) x Retention Days) +
(8 MB/day x (Number of Servers * 5) x Retention Days) +
(8 MB/day x (Number of Domain Controllers * 100) x Retention Days) + 8 MB = ACS Database Size Estimate
Grooming Interval (days)
3000 WS, 600 Server, 14 DC
2000 WS, 400 Server, 10 DC
1000 WS, 200 Server, 8 DC
500 WS, 100 Server, 5 DC
100 WS, 20 Server, 2 DC
50 WS, 10 Server, 2 DC
10 WS, 2 Server, 1 DC
1
59208
40008
22408
12008
3208
2408
968
2
118408
80008
44808
24008
6408
4808
1928
3
177608
120008
67208
36008
9608
7208
2888
4
236808
160008
89608
48008
12808
9608
3848
5
296008
200008
112008
60008
16008
12008
4808
6
355208
240008
134408
72008
19208
14408
5768
7
414408
280008
156808
84008
22408
16808
6728
8
473608
320008
179208
96008
25608
19208
7688
9
532808
360008
201608
108008
28808
21608
8648
10
592008
400008
224008
120008
32008
24008
9608
11
651208
440008
246408
132008
35208
26408
10568
12
710408
480008
268808
144008
38408
28808
11528
13
769608
520008
291208
156008
41608
31208
12488
14
828808
560008
313608
168008
44808
33608
13448
From what we have seen so far, these equations match well with results we have seen published. Hopefully, this will help you provide more accurate estimates of the database storage sizes required for OpsMgr 2007.
Issue: Cannot delete Agentless Managed Systems from Operations Manager 2007
Symptom: Agentless systems cannot be removed from the Operations Console -> Administration -> Agentless managed node. Errors generated in the Operations Event log when the attempted delete occurs
Error Message:
Data Access Layer rejected retry on SqlError: Request: p_ManagedEntityInsert -- (BaseManagedEntityId=e7b2cdaa-9a6f-1fba-c2ab-e9eb85bddd80), (TypeManagedEntitId=e7b2cdaa-9a6f-1fba-c2ab-e9eb85bddd80), (ManagedTypeId=02c5162c-b89f-1d61-349a-9b0667fbd60e), (FullName=Microsoft.Windows.Server.AD.ConnectionObject.Odyssey.com;ODYSSEY\ASCENSION), (Path=Odyssey.com), Name=(ODYSSEY\ASCENSION), (TopLevelHostEntityId=292f6a2e-6f02-89c0-ba17-2037b6427b04), (DiscoverySourceID=eada0326-173c-cb83-4aed-8845d7f8f650), (HealthServiceEntityID=7edb784f-b59c-e7b0-e82a-cc96cc355c38), (PerformanceHealthServiceCheck=True), (TimeGenerated=5/31/2007 11:31:53 PM), (RETURN_VALUE=1) Class:16 Number:77980008 Message: Health service ( 7EDB784F-B59C-E7B0-E82A-CC96CC355C38 ) should not generate data about this managed object ( E7B2CDAA-9A6F-1FBA-C2AB-E9EB85BDDD80 ).
Cause: Configured Agentless monitoring for a domain controller (used to monitor until the service pack could be updated), and then installed an agent on the domain controller when it was available via a manual installation.
Resolution: Updated the row in the BaseManagedEntity table for the agentless system that would not delete to have a value of IsDeleted=1. Deleted rows from the DiscoverySourcetoTypedManagedEntity for the system that would not delete.
Detailed Resolution: The standard caveat on this; this may not be a supported change but it was the only way we could get the system to be monitored again with an agent.
From the Operations Event log for the DataAccessLayer source/eventid 3333 locate the unique identifier from the message (highlighted in BOLD in the error message above)
Log into SQL Server Management Studio and take the follow query; run it to validate that a single row is correct and that the Name field matches the expected Agentless system name:
SELECT * FROM dbo.[BasemanagedEntity] where BaseManagedEntityID='<# from event log here>'
Then update the value of the IsDeleted column from 0 (which it was in our case in Step 2 above) and change it to 1.
This causes it to disappear from the Agentless section.
UPDATE dbo.[BasemanagedEntity]
SET IsDeleted = 1
where BaseManagedEntityID=’<# from event log here>’
Remove rows from the dbo.DiscoverySourceToTypedManagedEntity table corresponding to the number in the event log.
To Find the rows:
select * from dbo.DiscoverySourceToTypedManagedEntity where TypedManagedEntityID=’<# from event log here>'
To delete the discovery source to typed managed entity rows:
delete from dbo.DiscoverySourceToTypedManagedEntity where TypedManagedEntityID=’<# from event log here>’
To validate the rows no longer exist exist, run the following query. It should not return any data:
select * from dbo.DiscoverySourceToTypedManagedEntity where TypedManagedEntityID=’<# from event log here>’
Log back into the Operations Console -> Administration -> Device Management -> Agentless management node and refresh. This removes the agentless system that we had previously been unable to delete from this screen
Approve the Manual Agent install under Pending Management when it appears (probably will need to refresh to see this change)
Within the Monitoring node, validate that the system is listed and has status for the agent.
Wait for the agentless configuration to reappear in the Operations Console -> Administration -> Device Management -> Agentless management node.
Delete the system from the Agentless managed area (works this time!). This removes both the Agentless and Agent configuration for the system
Redeploy the agent using the Discovery wizard, choosing Agent instead of Agentless.
Validate that the system now appears under the Agent managed node and does not appear under either Agentless Managed or Pending Management.
Lesson learned: Don't use Agentless monitoring on a domain controller and then manually install it with an agent :-)