Perfil de OperationsOperations ManagerBlogListas Herramientas Ayuda

Blog


18/10/2009

OpsMgr R2 by Example: the Print Server MP

The Windows Print Server management pack is available as a single download that contains different libraries to monitor Windows Print services on Windows Server 2000, 2003 and 2008 operating systems.

How to Install the Windows Print Server MP

  1. Download the Windows Print Server management pack from the Management Pack Catalog (http://technet.microsoft.com/en-us/opsmgr/cc539535.aspx). The Windows Print Server Management Pack Guide is included in the download and labeled “OM2007_MP_PrintSvr.doc.”
  2. Read the Management Pack guide to gather additional information such as that the current version only supports monitoring clustered instances of the Print Server role on Windows Server 2008.
  3. Import the Print Server Management Pack (using either the Operations console or PowerShell) after importing the Windows Server management pack.
  4. Create a PrintServer_Overrides management pack to contain any overrides required for the MP.

The Windows Print Server MP supports agentless monitoring with the exception of tasks.

Windows Print Server MP Tuning / Alerts to look for

The following alerts were encountered and resolved while tuning the various print server management packs (these are listed in alphabetical order by Alert name):

Alert: Document failed to print

Issue: Document failed to print, categorized as a critical alert.

Resolution: This should not be a critical alert. Downgraded to a warning level alert through an override in the PrintServer_Overrides management pack.

Alert: Printer: Publish Error

Issue: The print server had a single DNS server defined for its location and the DNS server was down.

Resolution: The DNS server was brought back online and a second DNS server was configured for the system that had reported this error. Manually closed the alert.

Alert: Shared Printer Availability Alert

Issue: An error was generated for each of the shared printers on the server. These were generated by the Shared Printer: Restart the print spooler fix sharing problems and check Group Policy alert.

Resolution: This is an alert not a monitor so it does not automatically resolve itself. There does not appear to be an equivalent number for the same source in the event logs which indicates that the print spool is back online (which may be why this is an alert rather than a monitor).

Per the knowledgebase, restarted the spooler service on the server with the issues when this was caused by a situation that was not a reboot.

This also occurred when the print server was rebooted. Verified the ability to print to the printer specified after the reboot completed.

Print Server Management Pack Evolution

While the Windows Server 2008 version of this management pack functions well, the earlier versions of this management pack do not work well on a clustered server (either Windows 2000 or 2003). There are also issues with the approach to discovery on Windows 2000/2003 servers so that many servers are identified as being print servers that may not actually be print servers. The evolution of this set of management packs would be to bring the same functionality now available in the Windows Server 2008 management pack to the Windows 2000 and 2003 versions.

16/10/2009

OpsMgr R2 by Example: the DHCP MP

The Windows DHCP Server management pack is available as a single download that contains different libraries to monitor Windows DHCP on Windows Server 2000, 2003 and 2008 operating systems.

How to Install the DHCP MP

  1. Download the Windows DHCP Server management pack from the Management Pack Catalog (http://technet.microsoft.com/en-us/opsmgr/cc539535.aspx). The Windows DHCP Server Management Pack Guide is included in the download and labeled “OM2007_MP_DHCP_2003_2008 QFE110408.doc.” Read the Management Pack guide, as this covers items to be aware of with the DHCP such as how DHCP Clustering and multicast scopes are not supported.
  2. Import the Windows DHCP Server Management Pack (using either the Operations console or PowerShell).
  3. Create a WindowsDHCP_Overrides management pack to contain any overrides required for the MP.

DHCP MP Tuning / Alerts to look for

The following alerts were encountered and resolved while tuning the various DHCP management packs (these are listed in alphabetical order by Alert name):

Alert: DHCP IPv4 Runtime Authorization Needed Alert

Issue: DHCP scopes (both IPv4 and IPv6) were showing up as turned off offline/not authorized. This server had been authorized and then the IP address of the server was changed. The authorization was listing the previous IP address.

Resolution: Unauthorized the current server name/wrong IP address and re-authorized it with the correct IP address. This was occurring on a domain controller in a child domain. To do this change, logged into the root domain and authorized/re-authorized the server, and then restarted services on the domain controller after a short period of time.

Alert: DHCP Scope Addresses Available Monitor

Issue: Alert description is that the available scope addresses have fallen below the specified threshold. This is raised by the DHCP Scope Addresses Available Monitor. This monitor goes to warning level when there are less than 10 available IP addresses in the pool, and to critical when there are no remaining IP addresses in the pool. This environment originally had one DHCP server that contained the entire scope of available addresses. To provide redundancy, the scope was split between two different DHCP servers. This unfortunately leads to a tendency for the original DHCP server scope to fill while the other scope remains with a large number of available addresses in the range.

Resolution: For this environment, it was necessary to match up the two different scopes to determine if the lack of addresses was really a lack of addresses or just half of the scope filled while the half remained open. Performed the following actions to make this more readily apparent:

  • Configured the warning states on this monitor to go to yellow when there are less than 2 available IP addresses in the scope.
  • Monitored within the Microsoft Windows DHCP Server -> Scope Health view and ordered by display name to validate that the address range was not critical on both halves of the scope.
  • Used the Microsoft Windows DHCP Server -> DHCP Performance Views -> Scopes & Superscopes -> Scope Free Addresses view sorted by Instance and color coded to match the colors for each half of the scope (so that as an example both halves of the Data Network on floor three show up as blue). Added this to the My Workspace view with the Y access limited to a maximum of 10 (to more easily identify scopes with less than 10 available addresses). This is useable but pretty unwieldy with a large number of DHCP scopes.

Alert: DHCP Service Bound to Static IP Address

Issue: The alert description on the Alert Context tab shows that “The DHCP service is not servicing any clients because none of the active network interfaces have statically configured IP addresses, or there are no active interfaces.”

Resolution: The product knowledge provided an effective resolution for this issue. The DHCP service was not bound to any IP addresses on the system. In this case this DHCP scope was not required and as it was the only DHCP scope on the system, removing the DHCP service from the system was an acceptable solution after deactivating the DHCP scope for a period of time.

Alert: Performance Threshold: Process\Working Set threshold exceeded.

Issue: DHCP management pack error. Occurring sporadically on the DHCP server in the environment, but not seeing any errors or issues as a result of the condition. The rule (Performance Threshold: Process\Working Set threshold exceeded.) is configured to work over a 5 minute interval, and to measure over three samples by default (per the overrides). The management pack says that the utilization is measured over 5 samples. In a 24-hour period, there were approximately 22 of these alerts occurring and self-closing. Increased the number of samples to measure over from 3 to 5 and tracking the result (as these were usually closing with 5-10 minutes automatically). This did not help the issue.

Resolution: In this environment, this is NOT affecting the ability of the DHCP server to function, but could affect it at higher-level values. This value is in place because exceeding this threshold can be an issue, so do not disable or override this rule unless you are sure that it is NOT impacting your environment.

Determined the trend of this value based upon the alerts in the environment (tried tracking it monitoring the performance counter for this variable with no luck as there does not appear to be one). For each alert, went to the alert context tab and tracked the values which appeared (17806131, 17780028, 17809408, 17823061, 17793024, 17788928, 17791658, 17800330, 17786197, 17787562, 17828522, 17772544, 17783466, 17824426, 17829888, 17821696, 17788928, 17870028). Determined the average, maximum (17870028) and minimum (17772544) values to determine where this threshold should be for the environment/found that closed alerts were not relevant as they showed values less than the threshold.

Created an override to change this value for this server (on the monitor) from the default of 17830000 to 18070000.

Alert: Script or Executable Failed to run

Issue: Script failure for Nslookuptest.js. Reporting for tests to Microsoft.com, localhost ip address, and the fully qualified name of the server all three failed at the same date and time.

Resolution: Noted the alert and the date/time to see if a root cause could be tracked back. Reviewed the event logs on the system to track back potential issues/none found. Reviewed the performance counters gathered by OpsMgr, but no bottlenecks identified during that timeframe. Closed the alerts.

Alert: The DHCP service has determined that it is not authorized on this domain.

Issue: Description says the DHCP/Binl service on the local machine belonging to the windows administrative domain (domain name) has determined that it is authorized to start. It is servicing clients now. This appears as a critical alert, but is actually stating that the DHCP server is working.

Resolution: The only option for the override on this is to disable it so the criticality of the alert cannot be changed. This should actually be an informational level alert. The only option currently is either to close the alerts or to disable the alert.

Alert: The DHCP Service is not servicing any clients because none of the active network interfaces have statically configured IP addresses, or there are no active interfaces

Issue: DHCP server cannot be a DHCP client.

Resolution: Hard-coded an IP address for the DHCP server.

DHCP Management Pack Evolution

It would be extremely useful if in future revisions of this management pack it could effectively match scopes (based upon name, or matching subnet potentially) and gather the information to provide a critical alert when the each of the different scopes was nearing empty.

Another useful enhancement would be to provide the number of available addresses in the range within the alert description text.

15/10/2009

OpsMgr R2 by Example: the Group Policy MP

The Windows Server Group Policy management pack is available as a single download that contains different libraries to monitor Windows Server Group Policy on Server 2003 and 2008 operating systems.

How to Install

  1. Download the Windows Server Group Policy management pack from the Management Pack Catalog (http://technet.microsoft.com/en-us/opsmgr/cc539535.aspx). The Windows Server Group Policy Management Pack Guide is included in the download and labeled “OM2007_MP_GP2008.doc.”
  2. Read the Management Pack guide, which points out solid tips like the installation order (windows server management packs, group policy 2008 management packs, and then group policy 2003 management packs).
  3. Import the Group Policy 2008 Management Pack (using either the Operations console or PowerShell), and then the Group Policy 2003 Management Pack.
  4. Create a GroupPolicy_Overrides management pack to contain any overrides required for the MP.

Agentless monitoring is not supported by the Windows Server Group Policy management pack.

Tuning / Alerts to look for

The following alerts were encountered and resolved while tuning the various Group Policy management packs (these are listed in alphabetical order by Alert name):

Alert: Application of Group Policy Alert

Issue: The alert monitor on the alert was the Time Skew Monitor. The computer in question was in the wrong time zone.

Resolution: Changed the time zone on the server reporting the alert.

Alert: Application of Group Policy Alert

Issue: Alert that a user in a different forest than the computer account is logging on and that Group Policy from the other forest is not currently allowed.

Resolution: This environment has two different forests. One of them is a new replacement forest and for it group policies are being built to replace the group policies used in the original forest. While users will log into the new forest with credentials for the old forest, the old forest group policies should not apply. This will eventually be resolved when the old forest is decommissioned. In the mean time, the monitor was overridden to not be enabled (override, parameter enabled = false) for All objects of type: Group Policy 2008 Runtime.

Alert: Folder Redirection CSE ProcessedWithErrors

Issue: Group policy client failed 1085 and event 107 (which showed the user that had the issue) before it. This was occurring on a terminal server (citrix).

Resolution: User did not have their home folder mapped correctly.

Alert: GPO Data Retrieval Error

Issue: Event log (application) userenv 1058 error on group policy.

Resolution: Found article #828760 that implies that ACSL sysvol issues with the domain controllers and service pack 1. Used gpupdate /force on the system to see if we could recreate the event. Found that it creates a 1704 message in the event log (information) that it succeeded. Tested accessing of this path from the domain name, and from each of the domain controllers that it should be using to authenticate. There were differences in the dates of the folders indicated within the error message itself. The actual content was consistent however. This was occurring on the all domain computers policy. No indication that this occurred because of a WAN outage.

Alert: GPO Data Retrieval Error

Issue: Every 5 minutes errors were occurring in the application log for Userenv for 1058 and then 1030.

Resolution: Determined that the domain controller had not been patched or rebooted in over six months (checked the system log for the event source of eventlog). Patched and rebooted the DC and the group policy errors stopped occurring.

Alert: Group Policy Preprocessing (Active Directory) Alert

Issue: DNS Issues occurred in the environment causing an inability to resolve names in the environment.

Resolution: Fixed the DNS resolution issue so the environment could resolve names.

Alert: Group Policy Preprocessing (Networking) Alert

Issue: This alert occurs when an event of 1058 is created in the system log for the source of GroupPolicy. This occurs when the system is unable to connect to \\abc.com\SysVol\Policies\abc.com\Policies\{guid}\gpt.ini (where abc.com is the domain name and guid is the guid provided in the alert). Issues like this are caused by network connectivity or network resolution, or FRS latency, or if the DFS client is not running (per the knowledge in the alert). Information on this event is available at http://technet.microsoft.com/en-us/library/cc727259(WS.10).aspx.

Resolution: In this case this was an errorcode number of 5, which is access is denied. From the details on the event copied the file path and verified that the system could open the file with notepad. Logged into a server that had the last event in the System log from the source of GroupPolicy and opened a command prompt (run-as administrator) and did a gpupdate /force. Verified successful creation of a 1502, and 1503. Verified that the majority of these alerts all occurred at the same time. Closed this alert.

Also verified that DNS was providing this information correctly. Opened nslookup and did a resolution for abc.com. Copied the name of the file shown (the gpt.ini file) and replaced the abc.com domain name with the actual IP address (\\1.1.1.1\SysVol\Policies\abc.com\Policies\{guid}\gpt.ini) and verified that each of the domain controllers not only had the gpt.ini file but that it was readable from the path specified.

Alert: Group Policy Preprocessing (Security) Alert

Issue: This alert appears to occur when there is an inability to resolve DNS from the system identified or group policy fails to apply. It is stating that the specified domain either does not exist or could not be contacted.

Resolution: Researching this alert from the system log event number 1054 found this article from Microsoft: http://technet.microsoft.com/en-us/library/cc727331(WS.10).aspx. After researching this, it appears that an event 1500 that has occurred since the 1054 occurred indicates that group policy is now functional.

Copied the name of the server from the alert detail pane, changed the view to Monitoring -> Computers and pasted the name of the server into the filter. Used the Computer Management Action to connect and remotely review the event logs for the server. Closed the alert after verifying that the 1500 has occurred since the 1054 occurred in the system log where the alert occurred.

Logged into a server which had the last event in the System log from the source of GroupPolicy and opened a command prompt (run-as administrator) and did a gpupdate /force. Verified successful creation of a 1502, and 1503. Closed this alert.

Group Policy Management Pack Evolution

The Group Policy File Access Monitor in the Group Policy 2008 management pack version 6.0.6648.0 should be a two-state monitor with a health condition of the 1500 event (or 1051 or 1052 or 1053) and a warning or critical for the 1058 event. This could be accomplished by creating a new custom monitor and disabling the original monitor included in the management pack.

The Machine Account Determination Monitor in the Group Policy 2008 management pack version 6.0.6648.0 should be a two-state monitor with a health condition of the 1500 event (or 1051 or 1052 or 1053) and a warning or critical for the 1054 event. This could be accomplished by creating a new custom monitor and disabling the original monitor included in the management pack.

14/10/2009

OpsMgr R2 by Example: The DNS MP

The Windows DNS Server management pack is available as a single download that contains different libraries to monitor Windows DNS on Server 2000, 2003 and 2008 operating systems.

How to Install the DNS MP

  1. Download the Windows Server DNS management pack from the Management Pack Catalog. The Windows Server DNS Management Pack Guide is included in the download and labeled “OM2007_MP_DNS2008_2003.doc.”
  2. Read the Management Pack guide for topics such as configuring the URL for external DNS monitoring, configuring the global zone resolution monitor, and configuring the forwarder availability monitor.
  3. Import the Windows Server DNS management pack (using either the Operations console or PowerShell).
  4. Enable Agent Proxy configuration on all Domain Controllers identified from the groups. This is in the Administration space, under Administration -> Device Management -> Agent Managed. Right-click each domain controller, select Properties, click the Security tab, and then check the box labeled “Allow this agent to act as a proxy and discover managed objects on other computers.” Perform this action for every DNS server, even if the DNS server is added after your initial configuration of OpsMgr.
  5. Create a DNSServer_Overrides management pack to contain any overrides required for the MP.

Agentless monitoring is not supported by the Windows Server DNS management pack.

DNS MP Tuning / Alerts to look for

The following alerts were encountered and resolved while tuning the Windows Server DNS management pack (listed in alphabetical order by Alert name):

Alert: Core Service File Writing

Issue: Alert created when adding a new reverse zone.

Resolution: This error will occur once after adding the new reverse zone. Logged into the servers reporting the error and verified the new zone was created and populated correctly. It can be closed out, and is not an issue unless it recurs.]

Alert: Core Service Zone Transfer Error

Issue: Alert created when adding a new reverse zone.

Resolution: This error will occur once after adding the new reverse. Logged into the servers reporting the error and verified that the new zone was created and populated correctly. It can be closed out, and is not an issue unless it recurs.

Alert: An exception was thrown while processing GetRelationshipTypesByCriteria for session id

Issue: Check if you have installed DNS MP in RTM version.

Resolution: Upgrade to DNS for 2000/2003/2008 (6.0.6278.27)

Submitted By: ziembor

Alert: DNS 2003 AD DS Load Alert

Issue: Error caused by the conversion of a zone from secondary to AD integrated. This occurred only once on the server as the conversion occurred.

Resolution: Closed the alert.

Alert: DNS 2003 Configure Authoritative Servers Alert

Issue: A secondary zone defined on a server had two different systems that it was configured to request zone transfers from. One of these two systems did not allow zone transfers and was failing and causing this error.

Resolution: Allow zone transfers on the primary DNS server that was not configured to allow zone transfers.

Alert: DNS 2003 Configure Authoritative Servers Alert

Issue: Alert generated by a system that had a secondary copy of the DNS zone. DNS had just been restarted on the server indicated in the alert as having refused the zone transfer.

Resolution: Closed the alert, as this is an expected condition when the DNS zone is down on the server that is configured to allow zone replication.

Alert: DNS 2003 Correct Master Server Problem Alert

Issue: An event of 6527 occurred in the DNS event log indicating that the zone had expired before it could obtain a successful zone transfer or update and that the zone was shut down.

Resolution: Logged into the server and reviewed the event logs, and found an event number 3150 that the same zone had since had a new version of it written. Used nslookup to verify that the server was able to provide resolution for the zone was listed as shut down. Closed the alert because the monitor did not have an event defined that would move it to a healthy state.

Alert: DNS 2003 delete zone copy alert

Issue: abc.xyz.com zone was previously loaded from a directory partition MicrosoftDNS but another copy was found in the DomainDnsZones. The server will ignore the new copy of the zone. In this case, there was an inconsistency for this zone on the General tab for the DNS zone. Some were configured with the second option (To all DNS servers in the Active Directory domain abc.com) and some were configured for the third option (To all domain controllers in the Active Directory domain abc.com). These are caused by DNS events of 4515 in the DNS event log. Details on this issue are available at http://support.microsoft.com/kb/867464.

Resolution: Convert the current Active Directory integrated zone to a standard primary zone and backup the file. Delete the AD integrated zone and allow the deletion to replicate. After the change has replicated convert the standard primary zone into an Active Directory integrated zone.

Another option is to using ADSIEdit to remove the partition stored in the MicrosoftDNS section.

Did the first option listed above and then closed the alerts, restarted DNS services on the server that was reporting the warnings to verify that they did not reappear.

Alert: DNS 2003 Resolution Time Alert

Issue: Large numbers of alerts are generated that indicate issues with performing a test of an a query to the 127.0.0.1 system across the environment. Based on seeing the performance counters on these items (highlight the alert, right-click and choose Performance View, select the DNS Server object, Counter Response Time) these are alerting at values over 5 (which were overridden from the default value of 1) very frequently, which is the default threshold in this version of the management pack (6.0.6480.0). The implication is that this value is 5 seconds but during testing have not seen a single nslookup query that took more than a second. From the All DNS Performance View (Monitoring -> Microsoft Windows DNS Server -> Performance -> All DNS Performance Data) it is apparent that for this environment 90% of the resolutions occur in less than a value of 20.

Resolution: Changed alert threshold to 20 seconds. See Kevin Holman’s blog for additional details at http://blogs.technet.com/kevinholman/archive/2009/02/24/dns-mp-noisy-resolution-time-alerts-and-how-to-deal-with-them.aspx

Alert: DNS 2003 Server External Addresses Resolution Alert

Issue: The rule performs a DNS query of type “NS” (as provided in the Query Type parameter), which means the query is search for the name servers of the domain provided in the Host parameter. The problem here is that the domain name provided is “www.microsoft.com”. Since this is a host name rather than a domain, the query returns a referral rather than a list of DNS servers. This results in the error message referenced above.

Resolution: You can fix the error in one of two ways (pick one, not both):

  • Set the Host parameter to “microsoft.com” (without the quotes). Then the query returns a list of DNS servers for the microsoft.com domain OR
  • Set the Query Type parameter to “A”. Then the query returns the IP address(es) for www.microsoft.com

Alert: DNS 2008 Correct the Configuration File Alert

Issue: Removal of the cache.dns file was taking place as part of the process to remove the root hints for this server.

Resolution: Closed the alerts, since this was an expected situation as part of the process to remove the root hints for the DNS server.

Alert: DNS 2008 Correct Master Server Problem Alert

Issue: The alert context screen provided additional information in the description field that specified that the: Zone (zonename) expired before it could obtain a successful zone transfer or update from a master server acting as its source for the zone. The zone has been shut down. This came from an eventid of 6527. This occurred on an active directory integrated stub zone.

Resolution: Logged into the server reporting the problem and verified the zone did exist and populated with what appears to be valid information. This was caused by the removal of a DNS zone from the master server that was defined for the zone. While investigating this, found there was a single master server defined for the zone. Added a second master server to provide additional redundancy to avoid issues with communicating with a single master server.

Alert: DNS 2008 Forwarder Availability Alert

Issue: DNS forwarders for the systems existed in another physical location and network connectivity was lost between the locations. This is identified by the DNS 2008 Forwarder Availability Monitor, which executes every 900 seconds (15 minutes).

Resolution: Specified a DNS forwarder in the same site as the system that was reporting the forwarder availability alert. For another time that this occurred, saw the alert and re-tested the forwarder configuration but it was no longer erroring out. After 15 minutes, OpsMgr automatically closed the alert.

Alert: DNS 2008 Forwarder Availability Alert

Issue: The DNS server was configured to conditionally forward resolutions to other DNS servers in other forests. However, the remote server was unable to be connected to via UDP port 53 so this alert was occurring.

Resolution: Worked with the firewall team to open UDP port 53 from the DNS server to the DNS server receiving the forward zone lookups.

Alert: DNS 2008 Monitor Zone Resolution Alert

Issue: The specific reverse lookup zone that was creating the alert had been deleted.

Resolution: Manually closed the alerts.

Alert: DNS 2008 Free Memory or other System Resources Alert

Issue: Removal of the cache.dns file taking place as part of the process to remove the root hints for this server.

Resolution: Closed the alerts, as this was an expected situation as part of the process to remove the root hints for the DNS server.

Alert: DNS 2008 Free Memory Or Other System Resources Alert

Issue: This error occurred along with a large number of other active directory and DNS related alerts. This one however was the key to identifying the core issue that was occurring. After logging into the system, verified that the server was unable to see its own file shares including \\localhost and \\{ip address}. In the alert description field, it said, “The DNS server could not bind a Transmission Control Protocol (TCP) socket to address 0.0.0.0. The event data is the error code. An IP address of 0.0.0.0 can indicate a valid “any address” configuration in which all configured IP addresses on the computer are available for use. Rebooting the server with this issue would temporarily resolve the issue.

Restart the DNS server or reboot the computer.”

Resolution: Tracked this down eventually to a Microsoft hotfix #961775, which is required for multiple processor systems running Windows Server 2008 (or Vista) with Anti-Virus software installed.

Alert: DNS 2008 Monitor Zone Resolution Alert

Issue: Occurs for some Active Directory Integrated Stub zones DNS zones hosted on the server whenever the server is rebooted. This does not appear to occur for either regular stub zones, or Active Directory-Integrated Primary zones.

Resolution: Alerts automatically closed when the server was fully back online.

Alert: DNS 2008 Resolution Time Alert

Issue: The DNS 2008 response time monitor checks for the speed of DNS resolutions every 15 minutes. If the response time is greater than 1 second, it generates an alert. The server responded to the DNS query in 1.061 seconds.

Resolution: Tracked the performance of this counter (object = DNS Server, Counter = Response Time), available by right-clicking on the alert and opening the performance view then setting the Look For to select Items by Text Search and typing in Response. This counter tracked between 0-10 seconds over a seven-day timeframe. The environment being tested is a brand new environment with no user load currently. Created an override for all DNS Servers to increase the ThresholdsSeconds counter from 1 second to 20 seconds and stored it in the management pack created to store the DNS overrides (MicrosoftWindowsDNS2008Server_Overrides). This now matches the override created for the same alert in the DNS 2003 management pack (DNS 2003 Resolution Time Alert). Kevin Holman discusses this in more detail at http://blogs.technet.com/kevinholman/archive/2009/02/24/dns-mp-noisy-resolution-time-alerts-and-how-to-deal-with-them.aspx.

Alert: DNS 2008 Server External Addresses Resolution Alert

Issue: The firewall product was blocking external connectivity to the forwarders that were defined for the DNS server.

Resolution: Removed the firewall restriction to block the IPs defined as forwarders for the DNS server.

Alert: DNS 2008 Troubleshoot AD DS And Restart DNS Server Alert

Issue: DNS was not functional until the domain controller was back online. This domain controller running DNS has been rebooted and this warning was reported.

Resolution: Closed the error after verifying via nslookup that DNS was working. This monitor (DNS 2008 Troubleshoot AD DS and restart the DNS service Server Service monitor) does not appear to return to green state automatically.

Alert: DNS 2008 Check Zone File Alert

Issue: Removal of the cache.dns file was taking place as part of the process to remove the root hints for this server.

Resolution: Closed the alerts as this was an expected situation as part of the process to remove the root hints for the DNS server.

Alert: DNS 2008 Zone Not Running Alert

Issue: Occurs for each DNS zone hosted on the server whenever the server is rebooted for each Active Directory stub zone on the server. This does not appear to occur for either regular stub zones, or Active Directory-Integrated Primary zones.

Resolution: Alerts automatically closed when the server was fully back online.

Alert: Resolution Time Alert

Issue: The DNS 2008 response time monitor checks for the speed of DNS resolutions every 15 minutes. If the response time is greater than 1 second, it generates an alert. The server responded to the DNS query in 1.061 seconds.

Resolution: Tracked the performance of this counter (object = DNS Server, Counter = Response Time), available by right-clicking on the alert and opening the performance view then setting the Look For to select Items by Text Search and typing in Response. This counter tracked between 0-3.5 seconds over a seven-day timeframe. The environment being tested is a brand new environment with no user load currently. Created an override for all DNS Servers to increase the ThresholdsSeconds counter from 1 second to 5 seconds and stored it in the management pack created to store the DNS overrides (MicrosoftWindowsDNS2008Server_Overrides).

Alert: Script or Executable Failed to run

Issue: For the script DNS2008ComponentDiscovery.vbs.

Resolution: Requires the DNS server(s) to have agent proxy configured (set in the OpsMgr Console -> Device Management -> Agent Managed -> Properties of the system, check the box on the Security tab).

DNS Management Pack Evolution

The default settings for DNS response time should most likely be increased from 1 second to more like 20 seconds due to average DNS response times seen in various environments.

Additionally, the ability to compare what zones exist one each DNS server and to report inconsistencies in what zones exist on what servers would be very useful when attempting to debug why name resolution is inconsistent in an environment.

12/10/2009

OpsMgr R2 by Example: The Operations Manager Management Pack

The Operations Manager R2 management pack is automatically installed when you install System Center Operations Manager.

How to Install the Operations Manager MP

  1. Download the Operations Manager Management Pack management pack guide from the Management Pack Catalog (http://technet.microsoft.com/en-us/opsmgr/cc539535.aspx) and is the “Operations Manager 2007 R2 Management Pack Guide.doc”
  2. Read the Management Pack guide as this covers items such as how to enable recovery for Health Service Heartbeats and creating Run As accounts.
  3. Create an OperationsManager_Overrides management pack to contain any overrides required for the MP.

Operations Manager MP Tuning / Alerts to look for

The following alerts were encountered and resolved while tuning the Operations Manager management pack (these are listed in alphabetical order by Alert name):

Alert: AD Agent Assignment: Admins User Role needs at least one domain account

Issue: AD integrated agent deployment was not functional. The OpsMgr service was installed and running on the agent but would not show up in the pending management folder.

Resolution: Per product knowledge, “Add the security group, which was provided as parameter to MOMADAdmin.exe to the Operations Manager Administrators User Role.” In the OpsMgr console -> Administration -> Security -> User Roles -> Operations Manager Administrators, added the group specified when using the MomADAdmin.exe program to configure AD Integration. Closed the alert, re-started services on the server where AD agent assignment was failing (no change). Restarted the three services (in SP 1 these were OpsMgr Config Service, OpsMgr Health service, OpsMgr SDK Service, in R2 they are now System Center Data Access, System Center Management, System Center Management Configuration) and the agent appeared in of the OpsMgr console -> Administration -> Device Management -> Pending Management as expected.

Alert: Agent proxying needs to be enabled for a health service to submit discovery data about other computers.

Issue: The agent specified in the alert description does not have agent proxy enabled.

Resolution: Found the name of the system within the alert description field (dc.abcco.com), copied the server name and opened the administration node -> Device Management -> Agent Managed and filtered on the name of the server (pasted in). Right-click on the server, go to the Properties on the Security tab. Check the Allow this agent to act as a proxy and discover managed objects on other computers checkbox. This is an alert rule so it will not auto-close, manually closed the alert on the monitoring section of the OpsMgr console.

Alert: Backward Compatibility Script Error

Issue: MOM Backward Compatibility Service State Monitoring Script on line # 71.

Resolution: This is a bug in WMI. The BlackBerry MDS Connection Service has a very long ImagePath registry entry, when the health service script runs Select DisplayName, State, Name, StartMode, StartName FROM Win32_Service a null is returned for the StartName because the buffer allocated for the results it too small and the call fails, this can be verified using wbemtest. Connect to the root\cimv2 namespace and run the following query:

Select DisplayName, State, Name, StartMode, StartName FROM Win32_Service

In the results, scroll down to BlackBerry MDS Connection Service and double-click on the row to view the details, as you can see in Properties the StartName is null.

The problem is described at http://groups.google.co.uk/group/microsoft.public.win32.programmer.wmi/browse_thread/thread/4cef045b79c1b5cb/1ee2b09a1fa130ab?lnk=st&q=win32_service.startname+is+null&rnum=1&hl=en#1ee2b09a1fa130ab. Tried to obtain the fix mentioned but MS support said that the Bug ID did not exist.

The workaround is to change the path so that it uses the short (8.3) folder names, e.g.

Original Key:

“C:\Program Files\Research In Motion\BlackBerry Enterprise Server\MDS\bin\bmds.exe” -s jvmpath=”C:\Program Files\Java\jre1.5.0_11\bin\client\jvm.dll” -XX:+DisableExplicitGC -Xss64K -Xmx768M -Xms128M classpathdir=”C:\Program Files\Research In Motion\BlackBerry Enterprise Server\MDS\classpath” wrkdir=”C:\Program Files\Research In Motion\BlackBerry Enterprise Server\MDS\Servers\BES1″ webserverdir=”C:\Program Files\Research In Motion\BlackBerry Enterprise Server\MDS\webserver” -rbes “BES1_MDS-CS_1″

New Key:

“C:\PROGRA~1\RESEAR~1\BLACKB~1\MDS\bin\bmds.exe” -s jvmpath=”C:\Program Files\Java\jre1.5.0_11\bin\client\jvm.dll” -XX:+DisableExplicitGC -Xss64K -Xmx768M -Xms128M classpathdir=”C:\PROGRA~1\RESEAR~1\BLACKB~1\MDS\CLASSP~1″ wrkdir=”C:\PROGRA~1\RESEAR~1\BLACKB~1\MDS\Servers\BES1″ webserverdir=”C:\PROGRA~1\RESEAR~1\BLACKB~1\MDS\WEBSER~1″ -rbes “BES1_MDS-CS_1″

Restart the service, and rerun the query in WBEMTest, with the shorter path the server now returns the correct username.

It would be preferable if the problem was fixed properly, but the workaround does not seem to cause any adverse effects.

UPDATE: Found this on another system with a different type of service. The start name was null, and the service would not start when attempted to start it. Used the sc delete to remove the service, rebooted the system, and it worked like a champ.

Alert: Check the application's security policy

Issue: Two management servers were added into an environment where AD Integration was configured. This alert occurred on both systems when RMS’s OpsMgr Health Service was restarted.

Resolution: Gave the same access rights to the new management servers as had been given to the RMS by adding the computer accounts into the MOMADSecurityGroup created as part of the process to configure AD Integration in OpsMgr. Once this was done, it was verified by checking in Active Directory Users and Computers (View, Advanced Features) and validating that in the OperationsManager container under the name of the management group that the additional management servers had records defined for them.

Alert: Connection Timeout

Issue: On a TCP Port monitor, two alerts are generated when the system cannot be communicated with. The first is a Connection Timeout, and the second is a <Servername> Group Roll-Up Monitor. The server in question was being monitored via a TCP Port monitor to provide rudimentary monitoring through monitoring the RDP port (3389).

Resolution: The system in question was offline and needed to be brought back online, so the monitor functioned as expected.

Alert: Failed Agent Push/Repair - Remote Agent Management operation failed

Issue: Failed attempting to push the agent to the system.

Resolution: Logged into the system and manually installed the agent.

Alert: Data Warehouse configuration synchronization process failed to write data

Issue: After importing a large number of management pack files, the data warehouse started reporting issues. The health explorer listed an event number 31552 that the data filed to store in the data warehouse due to a SQLException Timeout expired.

Resolution: On the data warehouse server, used sp_updatestats to update the OperationsManagerDW database per notes in the newsgroups from Vitaly. The alerts were automatically closed after this action was performed.

Alert: Data Warehouse failed to deploy reports for a management pack to SQL Reporting Services Server

Issue: The DNS management pack can cause issues in the environment resulting in event ID 26319 from the OpsMgr SDK Service (System Center Data Access is the new service name in R2).

Resolution: Add the account designated as the Data Reader account to the group designated as Operations Manager Administrators during setup (this group is added to the Operations Manager Administrators role). This issue only exists with the DNS Management Pack (version 6.0.5000.0) and no other management packs.

Alert: Data Warehouse failed to request a list of management packs from SQL RS server

Issue: The data warehouse reporting server was being rebooted.

Resolution: Once the reporting server was back online, this alert auto-resolved itself.

Alert: Data Warehouse managed object type synchronization process failed to write data

Issue: After importing a large number of management pack files, the data warehouse started reporting issues. The health explorer listed an event number 31554 on the workflow Microsoft.SystemCenter.DataWarehouse.Synchronization.TypedManagedEntity.

Resolution: On the data warehouse server, used sp_updatestats to update the OperationsManagerDW database per notes in the newsgroups from Vitaly. The alerts were automatically closed after this action was performed.

Alert: Failed to Check for Password Expiration on RunAs Account

Issue: Operations Manager is unable to monitor Run As accounts for account and password expiration for the server specified.

Resolution: There was an error on the account (Administration -> Security -> Run As Profiles). In this case, the domain name had a typo on it.

Alert: Failed to send notification using server/device

Issue: Issues providing notification via Instant Messaging.

Resolution: The Instant Messaging configuration defaulted to port 5060, but the IM server itself was configured to use port 5061. Tested connectivity from the OpsMgr server to the LCS server with telnet <ServerName> and it did answer on the telnet. Configured a Run As Account for Notification Account for the OpsMgr server using the same account specified in the Notification settings. Tried logging in to LCS using the account configured as the Instant Messaging and sent a test IM message. Does Communicator need to be installed on the OpsMgr box? (Installed to test it, logged into the account that SIP was going to use).

Alert: Failed to send notification

Issue: Notification in OpsMgr was configured for a single SMTP server. When this server was offline, these alerts occurred (logically).

Resolution: Defined additional SMTP servers to provide failover in case of loss of the primary SMTP server system. Used the Alert Forwarding MP to validate connectivity to the connectivity to each SMTP server (discussed at http://cameronfuller.spaces.live.com/blog/cns!A231E4EB0417CB76!1737.entry).

[lb] Alert: Failed to send notification using server/device

Issue: Email was being sent to a remote email environment and communication was lost between the environments.

Resolution: When communication between the environments was restored, notification began to function again. Closed the alert, as it did not recur after communication was re-established.

Alert: Failed to send notification using server/device

Issue: Notification in OpsMgr was configured for a single SMTP server. When this server was offline, these alerts occurred (logically).

Resolution: Defined additional SMTP servers to provide failover in case of loss of primary SMTP server system. Used the Alert Forwarding MP to validate connectivity to the connectivity to each SMTP server (discussed at http://cameronfuller.spaces.live.com/blog/cns!A231E4EB0417CB76!1737.entry).

Alert: Failed to send notification using server/device.

Issue: Blocked on Exchange 2007 http://msexchangeteam.com/archive/2006/12/28/432013.aspx. The box that was being pointed to did not respond on port 25 as the system was a mailbox server, not a client access server. Notification failed later due to security issues from an anonymous connection (the default configuration).

Resolution: Re-configured OpsMgr to use the client access server that did respond on port 25. Configured the notification to use Windows Integrated authentication. Configured a Run As Account and configured the Run As Profile for the Notification Account for the management server to use the account which was created.

Alert: Failed to send notification using server/device

Issue: The RMS lost communication with the various SMTP servers that were defined. Once the network communication was back online, notifications were able to be sent.

Resolution: Lowered the priority of this alert to warning, as there is a critical for the alert “Failed to send notification” which appears to occur when not all SMTP servers can be communicated with.

Alert: Health service heartbeat failure

Issue: The OpsMgr health service on the agent was stopped. Another potential cause is if the OpsMgr health service on the agent was running but unable to communicate with the OpsMgr management server.

Resolution: Restarted the OpsMgr agent with Computer Management through the Actions pane. For the unable to communicate issue, the server was running a security application that restricted network traffic and blocked the network traffic from the server to the OpsMgr management server via port 5723.

Alert: OleDB: Results Error

Issue: Network communication between the RMS and the Operations Manager database was interrupted. The alert rule which generates this critical alert is the OleDbProbe: Results Error”

A good discussion on these types of alerts is available at http://blogs.technet.com/jonathanalmquist/archive/2008/07/29/oledb-results-error.aspx.

Resolution: In this case, once network connectivity was re-established between the RMS and OpsMgr database the alert was no longer relevant and was manually closed. Created an override to disable this alert for the RMS that was reporting these occasionally per the link listed in the issue section of this alert.

Alert: Ops DB Free Space Low

Issue: The Operations Database for OpsMgr 2007 has less than 40% free space available.

Resolution: OperationsManager database was not large enough to provide seven day (default) retention for the number of agents being monitored. Increased the size of the database using the SQL Server Management Studio (install it on the system running the OpsMgr console for ease of use). Connect to the server running the OpsMgr Operational database (shown in the alert), open the server/databases, right-click on the OperationsManager database (default name) and click Properties. Click on the Files tab, change the MOM_DATA size to the new size and click OK. You can validate the change in size occurred by going back to the properties of the database. The alert will resolve itself in Operations Manager in approximately 15 minutes if enough free space is available, as this monitor is defined to a 900-second frequency.

Alert: Recipient address is not valid.

Issue: Recipient address is not valid for notification. Email was sent to remote email environment and communication was lost between the environments.

Resolution: When communication between the environments was restored, notification began to function again. Closed the alert as it did not recur once communication was re-established.

Alert: Root Management Server Unavailable.

Issue: Alert occurring, but the OpsMgr Health Service was running on the RMS server. The alert description said ‘The root management server (HealthService) is running but has reported limited functionality soon after (date/time). The specific reason code is 49 and description is “The health manager has detected that entity state collection has stalled.’ This happened immediately after installing the reporting server into the OpsMgr environment.

Resolution: Restarted the OpsMgr Health Service on RMS system and the alert closed.

Alert: Root Management Server Unavailable

The following alert randomly recurs on an RMS with no related alerts and with no apparent cause:

The root management server (Healthservice) has stopped heartbeating soon after (date and time). This adversely affects all availability calculation for the entire management group.

Resolution: If the alert truly had no discernable root causes, then and the Root Health Service Watcher should be tweaked to allow for a greater variance in the heart-beating interval by adding a DWORD value named MinutesToWaitBeforeAlerting to the following registry key and setting it to 5:

HKEY_LOCAL_MACHINE\Software\Microsoft\Microsoft Operations Manager\3.0\SDK Service\RHS Watcher

Restart the Health, SDK, and Config services on the RMS after this change.

Submitted By: Jason Sandys

Alert: RunAs Logon Type Check Failed

Issue: The RunAs account failed to log on interactively. The RunAs account needs to have the Log on interactively right.

Resolution: Gave logon on interactively rights to the user created for the RunAs account, in this case through providing administrator access to the system in question.

Alert: RunAs Successful Logon Check Failed

Issue: Domain controllers for the domain where the SQL RS server existed were offline.

Resolution: Brought back online the domain controllers for the domain, and this alert auto-resolved itself.

Alert: RunAs Successful Logon Check Failed

Issue: One or more RunAs accounts failed to log on. The account may be disabled or has an expired password.

Resolution: Gave logon on interactively rights to the user created for the RunAs account, in this case through providing administrator access to the system in question.

Alert: Script or Executable Failed to run

Issue: Scripts not running on agentless managed system that is a NAS not an actual server. This occurs on both CPU Utilization and Memory Utilization.

Resolution: The only option on this if the NAS was going to be monitored agentless was to disable the alert for the RMS.

Alert: Script or Executable Failed to run

Issue: Lots of Script or Executable Failed to run errors on the same system all failing at the same time (in this case about a half-dozen or more would all fail with a 21402 (timeout).

Resolution: WMI was non-functional on the system (stuck at 100% utilization one processor). Stopped the WMI service, when that failed killed the process and re-started the WMI service.

Alert: Script or Executable Failed to run

Issue: Script failure for Nslookuptest.js. Reporting for tests to Microsoft.com, localhost IP address, and the fully qualified name of the server all three failed at the same date and time.

Resolution: Noted the alert and the date/time to see if a root cause could be tracked back. Reviewed the event logs on the system to track back potential issues/none found. Reviewed the performance counters gathered by OpsMgr, but no bottlenecks identified during that timeframe. Closed the alerts.

Alert: Script or Executable Failed to run

Issue: The process exited with 0 Command executed: C:\Windows\system32\cscript.exe /nologo IsHostingMSCS.vbs.

Resolution: Occurred immediately after deploying the agent onto a new server. From the newsgroups this can occur when discovery has not yet finished (written by Rob Kuehfus). Closed the alert to see if it would recur on this system.

Alert: Script or Executable failed to start

Issue: Paging file is too small.

Resolution: Needed to add memory to the system.

09/10/2009

OpsMgr R2 by Example: The Windows Server Management Pack

The Windows Server management pack is available as a single download that contains different libraries to monitor Windows Server 2000, 2003 and 2008 Operating Systems.

How to Install the Windows Server MP

  1. Download the Windows Server Operating System management pack from the Management Pack Catalog. The Windows Server Operating System Management Pack Guide is included in the download and labeled “OM2007_MP_WinSerBas.doc.”
  2. Read the Management Pack guide – for things like how to activate monitoring for physical disks and disk partitions.
  3. Import the Windows Server Operating System Management Pack (using either the Operations console or PowerShell).
  4. Create a WindowsServer_Overrides management pack to contain any overrides required for the MP.

Windows Server MP Tuning / Alerts to look for

The following alerts were encountered and resolved while tuning the various Windows Server management packs (these are listed in alphabetical order by Alert name):

Alert: Disk transfer (reads and writes) latency is too high

Issue: This monitor checks for high values on the performance counter every 60 seconds over a 5-minute timeframe.

Resolution: Determined that spikes were occurring on a specific drive on the system. The drive needs to be either be replaced with a higher speed drive, or some of the uses of this drive should be moved to another physical drive.

Alert: Event log is full

Issue: Alert generated by the windows server 2003 management pack from the Event Log File is Full alert rule. The alert description contains information about which event log is full (in this case it was the PowerShell log file).

Resolution: Logged into the server and verified that the log size was set to a maximum of 512KB and to override events older than 7 days. Re-configured to increase the size to 2048KB and to overwrite events as needed. Closed the alert.

Alert: Logical Disk Free Space is Low

Issue: Low disk space on the drive identified in the alert.

Resolution: Can either free up disk space on the drive or configure an override for the drive to change the monitoring configurations for the drive. You can configure the overrides for the system drives or non-system drives. For this configuration, there is a C drive, D drive, and Q drive. The Q drive was critical, and free space could not be made available on the drive. The only options available without modifications to the script (which are not viable in sealed management packs) are to set an override for non-system drives and set it to a level where the Q drive is no longer critical. This means that the levels for the D drive on the same system will not fire until it hits the new critical levels. The other option is to acknowledge the alert and not to resolve it at this point. The script that does this check is called FreeSpace.vbs and automatically distributed into a temporary directory located under %ProgramFiles%\System Center Operations Manager 2007\Health Service State.

Alert: Network Interface failed.

Issue: Network interface on a system was no longer online.

Resolution: The system in question had been accidentally unplugged from the network. Closed the alert after the network interface was online.

Alert: The device has a bad block.

Issue: Bad block on the drive on the system.

Resolution: Ran chkdsk /F to scan for bad blocks that required a reboot due to the bad block being identified on the boot partition.

Alert: The event log file is full. New event instances will be discarded.

Issue: The event log was set to override events older than 7 days.

Resolution: Increased the event log size from 512KB to 2048KB, and set to overwrite events as needed.

Alert: The service terminated unexpectedly.

Issue: The service identified in the alert failed.

Resolution: Verified that the server can be pinged using the tasks on the right, and using the Computer Management task verified the service was in a started state. Closed the alert after placing information in the company knowledge to track this for a pattern to see what is causing the service to fail. In one case the service was actually down, used the Computer Management task to restart the service.

Alert: The share configuration was invalid. The share is unavailable.

Issue: The share within the alert was a user share on a system.

Resolution: Determined the user did still exist in Active Directory (AD Users and computers, validated that the user name was the same). Recreated the user folder per the product knowledge. If the user no longer existed, the share would have been removed using the net share /delete option presented in the product knowledge.

Alert: Too many requests for performance counter data have timed out

Issue: In this environment, this only seems to occur with Windows 2000 systems running Diskeeper. Diskeeper started at just after 9 pm, and then there is an alert just after 10:15 pm (perflib event id 1015 in the application log for the PerfDisk performance data counter), and Diskeeper completes its running just after this event is logged.

Resolution: Disabled this alert for the specific servers that are Windows 2000 systems running Diskeeper. Stored the override in the MicrosoftWindowsServer_Overrides management pack kept for overrides on the various Operating System related management packs. If there were a large enough number of systems, it would be recommended instead to upgrade the version of Diskeeper (or the operating systems).

Alert: Total CPU Utilization Percentage is too high

Issue: Most likely, the processor on the system is currently over-utilized and indicating a bottleneck condition. Common potential causes for this include:

  • Misconfigured anti-virus can cause high processor utilization if files which should be excluded from scanning are not (such as for Exchange databases, logs, and the bin directory).
  • Hardware failure is another possibility that should be considered and research through the hardware vendor.
  • A hung process may be consuming resources to the exclusion of all others.
  • A large portion of the time the system actually is bottlenecked. This can be verified either by checking in the processor performance counters gathered by OpsMgr to determine if there is a consistent bottleneck. You can also check this by logging into the system and using task manager to determine what is using up CPU cycles. Most likely, it is a process running on the system that is using too much processing.
  • A great Microsoft discussion on Processor Bottlenecks is available at http://technet.microsoft.com/en-us/library/aa995907.aspx.

Resolution: Add more processing resources (faster processors, additional processors), replace the system with stronger processor(s), split the load through network load balancing, or move off programs/services creating load to the system. Until the processing bottleneck can be addressed, determine from the trending of the performance counters what an acceptable level is for this particular system in your organization and set an override so that alerts will be generated only if the system goes beyond the levels identified for the server.

Alert: Total Percentage Interrupt Time is too high

Issue: Most likely, the processor on the system is currently over-utilized and indicating a bottleneck condition. Common potential causes for this include:

  • Misconfigured anti-virus can cause high processor utilization if files which should be excluded from scanning are not (such as for Exchange databases, logs, and the bin directory).
  • Hardware failure is another possibility that should be considered and research through the hardware vendor.
  • A hung process may be consuming resources to the exclusion of all others.
  • A large portion of the time the system actually is bottlenecked. This can be verified either by checking in the processor performance counters gathered by OpsMgr to determine if there is a consistent bottleneck. You can also check this by logging into the system and using task manager to determine what is using up CPU cycles. Most likely it is a process running on the system that is using too much processing.
  • A great Microsoft discussion on Processor Bottlenecks is available at http://technet.microsoft.com/en-us/library/aa995907.aspx.

Resolution: Add more processing resources (faster processors, additional processors), replace the system with stronger processor(s), split the load through network load balancing, or move off programs/services creating load to the system. Until the processing bottleneck can be addressed, determine from the trending of the performance counters what an acceptable level is for this particular system in your organization and set an override so that alerts will be generated only if the system goes beyond the levels identified for the server.

Alert: Windows Event 2008 - Unable to read an event log

Issue: The application log file had corrupted in one instance and the server application log in another instance.

Resolution: Verified that the server was not in some way restricting access to the log file. Used the Computer Management task to fix the corrupt event log through right-clicking on the event log and choosing the option to Clear all Events and then re-opening the event log that had been corrupt.

Windows Server Management Pack Evolution

Overall, the Windows Server management pack provides a very strong set of functionality for Windows Operating Systems. An area that would be useful would be creation of additional diagnostics and recoveries such as one to run the disk cleanup utility on low disk space situations, and one to report on where drive space is used on a disk that is running low on disk space.

01/10/2009

OpsMgr R2 by Example: the SQL Server Management Pack

The SQL Server management pack is available as a single download which contains different libraries to monitor SQL 2000, 2005 and 2008 database servers.

How to Install the SQL Server MP

  1. Download the SQL Server Management Pack from the Management Pack Catalog (http://technet.microsoft.com/en-us/opsmgr/cc539535.aspx). The SQL Server Management Pack Guide is included in the download and labeled “OM2007_MP_SQLSrvr.doc.”
  2. Read the Management Pack guide – cover to cover. This document spells out in detail some important pieces of information you will need to know.
  3. Import the SQL Server Management Pack (using either the Operations console or PowerShell). It is recommended that you also import the appropriate version of the Windows Server management pack (Windows 2000, 2003, or 2008). The Windows Server management packs monitor various aspects of the OS that can influence the performance of those computers running SQL Server! This includes disk capacity, disk performance, memory utilization, network adapter utilization, and processor performance.
  4. Running the SQL Server Studio and SQL Profiler tasks from the OpsMgr console requires you have installed that software on all OpsMgr computers where these tasks will execute, or you will receive an error message “the system cannot find the file specified.” Installing the Management Studio and Profiler are not required unless you want to run those tasks.
  5. If your environment includes clustered SQL Servers, enable Agent Proxy configuration on all members of the SQL cluster. This is in the Administration space, under Administration -> Device Management -> Agent Managed. Right-click each SQL Server in the cluster, select Properties, click the Security tab, and then check the box labeled Allow this agent to act as a proxy and discover managed objects on other computers.
  6. Create a SQLServer_Overrides management pack to contain any overrides required for the MP.

The SQL Server MP supports agentless monitoring with the exception of tasks that start and stop SQL Server services and SQL Server mail. The management pack installs two Run As Profiles: the SQL Server Discovery account and the SQL Server Monitoring account. By default, the management pack uses the Default Action account.

SQL MP Optional Configuration

The SQL Server MP does not automatically discover all object types. Go to the Authoring Pane of the Operations console to enable discovering additional components. Components not discovered include:

  • SQL Server 2008 Distributor
  • SQL Server 2005 Distributor
  • SQL Server 2008 Publisher
  • SQL Server 2005 Publisher
  • SQL Server 2008 Subscriber
  • SQL Server 2005 Subscriber
  • SQL Server 2008 Subscription
  • SQL Server 2005 Subscription
  • SQL Server 2008 Agent Job
  • SQL Server 2005 Agent Job
  • SQL Server 2000 Agent Job
  • SQL Server 2008 DB File Group
  • SQL Server 2005 DB File Group
  • SQL Server 2008 DB File
  • SQL Server 2005 DB File

This means you will not receive alerts for these objects failing since they are not even discovered objects! For example, if you have scheduled SQL backups using the SQL Agent and the job fails, OpsMgr won't tell you about it. If an agent job failed in MOM 2005, the SQL MP generated an alert. So these behaviors are not necessarily the same between MOM 2005 and OpsMgr 2007.

You can use overrides to change the settings for automatic discovery to enable these object types. Be sure to change your settings in an unsealed MP other than the Default management pack.

SQL MP Tuning / Alerts to look for

The following alerts were encountered and resolved the following alerts while tuning the various SQL Server management packs (these are listed in alphabetical order by Alert name):

Alert: A SQL job failed to complete successfully.

Issue: A variety of scripts were failing on the system but the scripts were on a development server.

Resolution: Created an override to disable this alert on the development server experiencing the issues as there was no action required to address these on the development environment. Closed the alerts.

Alert: A SQL job failed to complete successfully.

Issue: A variety of scripts were failing on the system but the scripts were on a server which had a database that had been decommissioned.

Resolution: These jobs were not required, accessed the SQL server and disabled each of the jobs which were failing. Closed the alerts.

Alert: A SQL job failed to complete successfully.

Issue: There are close to 100 of the systems in an environment with only about 5 servers creating the alerts.

Resolution: Created an override to set these to low priority informational instead of warnings as no action was being taken when they occurred.

Alert: Auto Close Flag

Issue: The auto close flag for database MSCUPTDB in SQL instance MSSQL SERVER on computer 123.abc.com is not set according to best practice.

Resolution: As this is a standard Microsoft application (patch Management for SMS and Configuration Manager) and a default configuration, created an override to exclude this database.

Alert: Auto Close Flag

Issue: The auto close flag was set on a database that was used for anti-virus. This was an MSDE database that had been upgraded to a full SQL server. Changed the setting on the database to auto close false. The auto close setting is discussed at http://msdn.microsoft.com/en-us/library/ms190249.aspx. Per this article:

“True for all databases when using SQL Server 2000 Desktop Engine or SQL Server Express, and False for all other editions, regardless of operating system.”

Resolution: Changed the setting on the database to auto close equals false.

Alert: Auto Shrink Flag

Issue: The auto shrink flag for database (DBNAME) in SQL instance MSSQL SERVER on computer 123.abc.com is not set according to best practice.

Resolution: This was found on a series of standard Microsoft applications including SUSDB, WSUS and MSCUPTDB. Additional databases found with the Auto Shrink Flag: Backup Exec, ItAssist, SOE, DSPre, XRXDBDiscovery (Xerox), XRXDBCWW (Xerox). Options available include contacting the vendor to determine if this flag can be changed (and changing the flag if it can) or to create an override to exclude this database from monitoring this configuration.

Alert: Cannot start SQL Server Service Broker on Database

Issue: This occurred when a large number of management packs and reports were being imported into the management environment. A total of three events were created the application log (9697) each occurring about two minutes after the prior one.

Resolution: Validated that it was not still occurring and tracked back to a period of time when the OpsMgr environment was under significant strain.

Alert: Could not allocate space for object in database because the filegroup is full

Issue: The database could not extend because the filegroup was full. Logged into the server and verified that there was free disk space and that the filegroup was set to auto-grow. Attempted to manually extend the database but it failed because it was running in MSDE and was restricted in size to 4096 MB (see http://databases.aspfaq.com/database/what-are-the-limitations-of-msde.html for limits on MSDE).

Resolution: Moved the database from MSDE to a full version of SQL server and then expanded the database.

Alert: Percentage Change in DB % Used Space

Issue: This was on the ReportServerTempDB database.

Resolution: This was on the reportingtemp database which is very small to begin with (6MB). Major percentage changes occur as part of this being a temp table. Threshold ranges are between 25% and 45% (low value of threshold = 25, high value of threshold = 45). Created an override for this for this specific database due to its size (6MB in size). Threshold1 = 45 (this is the growth size, increased from 25), Threshold2 = 55 (this is the shrink size, ended up leaving it this way). Increased due to the size of this database as the percentage figures become out of whack with a database of this size.

Alert: Service Check Data Source Module Failed Execution

Issue: Error getting state of service, error 0x8007007b. Documented in the SQL management pack guide:

“If the SQL Full Text Search service is not installed on computers running SQL Server 2005 that are being monitored, disable the monitor”

Resolution:

For two systems that were running MSDE, configured an override to disable the alert.

For another system, the service had been set to manual. Configured the service to run automatically and started the service.

For another system, the primary instance had this service but the additional instance installed on it did not. Configured an override to disable the alert.

Finally one system had the service running for a second instance but not for the first instance. Configured an override to disable the alert.

Alert: Service Check Probe Module Failed Execution

Issue: Error getting state of service, error 0x8007007b for workflow name Microsoft.SQLServer.2008.DBEngine.FullTextServiceMonitor. Documented in the SQL management pack guide:

“If the SQL Full Text Search service is not installed on computers running SQL Server 2005 that are being monitored, disable the monitor”

Resolution:

For two systems that were running MSDE, configured an override to disable the alert.

For another system, the service had been set to manual. Configured the service to run automatically and started the service.

For another system, the primary instance had this service but the additional instance installed on it did not. Configured an override to disable the alert.

Finally one system had the service running for a second instance but not for the first instance. Configured an override to disable the alert.

Alert: Service Pack Compliance - MSSQLSERVER (SQL 2005 DB Engine) Warning

Issue: The database server was running SQL 2005 service pack (SP) 2, which is acceptable for the ACS database server (SQL 2005 SP 2 has been approved for all OpsMgr database components per threads on the newsgroups).

Resolution: Created an override (for specific object of type SQL Engine DB) to allow this configuration for this server/set the enabled parameter to False for this server. Reset the health for this health monitor on this server, and refreshed and the state updated to green from yellow.

Alert: Service Pack Compliance

Issue: The SQL server in question was a SQL Server 2005 that was running SP 2. OpsMgr identified this as non-compliant because the rule was checking for service pack 1. Verified the version of SQL through the query: SELECT @@Version which returned:

Microsoft SQL Server 2005 - 9.00.1399.06 (Intel X86)

Oct 14 2005 00:33:37

Copyright (c) 1988-2005 Microsoft Corporation

Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

This issue occurred using version 6.0.6278.8 of the SQL management pack. There was a new version of the management pack available (6.0.6460.0). Downloaded and installed the new version of the management pack, closed the alert. This did not resolve the issue.

Created an override to set the “Good Value” from 1 to 2 within the Service Pack Compliance monitor and stored it in the MicrosoftSQLServer_Overrides management pack created for the SQL MP. Closed the alert but this did not resolve the issue as the MP was actually identifying the error condition but the OpsMgr administrator was incorrectly interpreting it.

http://support.microsoft.com/kb/321185 provided clarification as to what was seen here. The SQL install itself was RTM which did NOT have SP 1 installed.

Resolution: Installed SQL 2005 SP 2 on the system and closed the alert.

Alert: The SQL Server Service Broker or Database Mirroring transport is disabled or not configured

Issue: On the alert context, you will see that the description for eventid 9666 says “The Database Mirroring protocol transport is disabled or not configured.” What is interesting on this is that the same eventide (9666) means two different things: The Service Broker protocol transport is disabled or not configured or The Database Mirroring protocol transport is disabled or not configured. The only way to tell which of the two situations is occurring is on the alert context tab for the alert.

Resolution: The database mirroring protocol error reported is only relevant if database mirroring will be used on the server. If mirroring is not going to be used on the server, the alert should be disabled. To do so create an override to disable the alert for the specific server reporting the alert (again assuming that it doesn’t require database mirroring) and store it in a custom management pack (MicrosoftSQLServer_Overrides). Manually closed the alert.

SQL Server Management Pack Evolution

Microsoft wrote the SQL Server 2008 management pack to have functional parity with the SQL Server 2005 management pack. Other than SQL Job monitoring, the management pack focuses primarily on the condition of SQL Server and its services, rather than what is built on top of those installed objects and how they are configured and running.

It would be nice if future versions could incorporate monitoring SQL Server 2008 policy management and the database tuning advisor. SQL 2008 also incorporates a resource governor, which puts constraint on the various SQL components. An updated management pack should be aware of the constraints the governor puts on the components such that thresholds become relative to what has been set, rather than what the overall system is otherwise capable of doing.

SQL Server 2008’s Performance Studio and Performance Data Collection components have functionality at a depth beyond the current SQL MP. Incorporating this requires a transition from the high-level monitoring provided by the MP to the lower level tracing and reporting provided by SQL Server itself. Optimally, a new SQL MP would defer to the performance data collector, with tasks, diagnostics, and such to make that happen. There also would probably need to be some discovery and diagramming work added to the MP to show relationships between those components running collections and the warehouse where the results are stored.