Performance Tuning Series – Monitoring and Alerts: Staying Ahead of Issues

A proactive approach to monitoring and alerting is key to maintaining SQL Server performance. Monitoring provides real-time insight into the database’s health, resource consumption, and potential bottlenecks, while alerts enable prompt responses to issues before they impact users. SQL Server’s built-in tools, along with third-party solutions, offer effective ways to track key performance metrics, identify anomalies, and set up automated alerts to help stay ahead of performance problems.

Why Monitoring and Alerts Matter for Performance Optimization

Constantly changing workloads, hardware constraints, and system settings make SQL Server performance unpredictable. Monitoring helps identify trends and sudden deviations in performance, allowing for preventive measures or optimizations before they become critical. Alerts provide early warnings for issues like high CPU usage, long-running queries, or approaching storage limits, enabling quick intervention.

Key Monitoring and Alerting Best Practices for SQL Server

1. Define Key Performance Indicators (KPIs)

Before setting up a monitoring system, it’s essential to establish which metrics or KPIs best represent your SQL Server’s health and performance. These indicators should cover areas such as CPU, memory, disk usage, and specific SQL Server metrics.

  • Best Practice:
    • Track essential KPIs, including CPU utilization, memory usage, disk I/O, query wait times, and buffer cache hit ratio.
    • For databases with heavy I/O demands, monitor page life expectancy (PLE), which shows how long a data page stays in the buffer cache.
    • Keep an eye on log growth and transaction log space usage for signs of excessive logging, which may indicate poorly optimized queries or large transactions.

2. Use Dynamic Management Views (DMVs) for Real-Time Monitoring

SQL Server’s Dynamic Management Views (DMVs) provide valuable insights into system performance by capturing real-time data on query execution, memory usage, and index efficiency. DMVs are essential for identifying specific issues that require immediate attention.

  • Best Practice:
    • Use DMVs like sys.dm_exec_query_stats to identify long-running or resource-intensive queries and sys.dm_os_wait_stats to understand wait types and locate bottlenecks.
    • Regularly analyze index usage using sys.dm_db_index_usage_stats to determine which indexes are frequently used and which are candidates for removal.
    • Automate DMV queries to gather data at regular intervals and retain historical performance data for trend analysis.

3. Set Up SQL Server Performance Alerts

SQL Server Agent allows you to configure alerts for specific events, such as high CPU or memory usage, job failures, or database connection issues. Alerts can be sent via email or logged to a table for review.

  • Best Practice:
    • Configure alerts for CPU usage exceeding 80%, memory pressure warnings, blocked processes, disk space thresholds, and failed SQL Agent jobs.
    • Set up alerts for critical wait types, like PAGEIOLATCH for disk-related bottlenecks or LCK_M_* for blocking and locking issues.
    • Ensure that alerts provide actionable information and avoid alert fatigue by fine-tuning alert thresholds and focusing on the most critical metrics.

4. Utilize SQL Server Extended Events for Detailed Diagnostics

SQL Server Extended Events offers a lightweight, customizable framework for tracking detailed performance data. Extended Events can capture detailed information on query execution, deadlocks, wait times, and system-level events without a significant performance impact.

  • Best Practice:
    • Create custom Extended Events sessions to capture specific performance issues, like long-running queries or high wait times.
    • Use Extended Events to monitor deadlocks and capture the SQL text, session ID, and resources involved for troubleshooting.
    • Archive Extended Events data for historical analysis, as this information is valuable for identifying recurring issues or performance trends.

5. Implement Database Monitoring Tools

In addition to built-in tools, various third-party solutions (e.g., SolarWinds, Redgate SQL Monitor, and Idera SQL Diagnostic Manager) provide advanced monitoring and alerting capabilities for SQL Server. These tools often offer features like dashboards, reporting, automated analysis, and recommendations.

  • Best Practice:
    • Choose a tool that provides real-time monitoring, historical trend analysis, and robust alerting based on your organization’s needs and budget.
    • Look for monitoring solutions that include visualizations for easy diagnosis of issues like query bottlenecks or resource contention.
    • Use third-party tools that support customizable alerts and integrate with incident management systems to streamline response workflows.

6. Monitor Query Performance

Query performance is one of the most significant factors affecting SQL Server’s efficiency. Using monitoring to detect slow or inefficient queries helps with timely tuning and optimization.

  • Best Practice:
    • Regularly monitor query execution times, CPU and I/O usage for top-running queries, and identify queries that are consistently resource-intensive.
    • Use Query Store in SQL Server to track query plans and execution statistics, making it easier to identify performance regressions after schema changes or updates.
    • Set alerts for queries running above a specific threshold or those with high execution counts that may indicate inefficiencies.

7. Monitor Tempdb Usage

Since tempdb is frequently used for temporary storage, sorting, and intermediate query results, high tempdb usage or contention can degrade performance. Monitoring tempdb ensures that it has adequate space and can handle temporary object creation demands.

  • Best Practice:
    • Monitor tempdb space usage using DMVs like sys.dm_db_task_space_usage and sys.dm_db_session_space_usage.
    • Configure alerts for tempdb growth or high utilization, which could indicate inefficient query processing or excessive use of temporary tables.
    • Use monitoring to detect contention on tempdb, especially on system pages like PFS, GAM, and SGAM, and address it by adding more data files.

8. Create Custom Alerts for Resource-Specific Metrics

Different applications and workloads may have unique performance demands, so SQL Server supports custom alerts tailored to your specific environment. Custom alerts allow more granular monitoring of resource-specific metrics, such as long lock times or unusually high I/O operations.

  • Best Practice:
    • Set custom alerts for lock escalation, deadlocks, and blocking sessions that exceed specified thresholds.
    • Create alerts for excessive logins, failed login attempts, or unusual access patterns for security monitoring.
    • Use custom alerts for sudden changes in query execution plans, which may indicate suboptimal plan choices or regressions due to outdated statistics.

9. Enable Automated Responses for Critical Alerts

Responding to alerts manually can be time-consuming, so SQL Server Agent allows you to define responses to specific alerts. Automated responses can include restarting services, running scripts, or adjusting resources temporarily to prevent further degradation.

  • Best Practice:
    • For critical alerts, configure automated responses such as restarting SQL Server services, clearing cache, or scaling up cloud resources if available.
    • Set automated scripts to collect additional diagnostics when an alert is triggered, helping capture valuable data for post-incident analysis.
    • Use escalation protocols to ensure that critical alerts that require manual intervention are immediately directed to the right team members.

10. Log Monitoring for Long-Term Insights

Long-term log monitoring is useful for understanding performance trends, identifying recurring issues, and tracking the impact of changes over time. SQL Server’s Error Log, Windows Event Log, and system_health Extended Events session are valuable sources of diagnostic information.

  • Best Practice:
    • Regularly review SQL Server Error Logs for messages related to I/O warnings, login failures, and deadlocks, and establish alerts for critical log entries.
    • Monitor Windows Event Logs for system-level alerts related to hardware failures, memory issues, or networking problems that could affect SQL Server performance.
    • Use log aggregation tools like Elasticsearch, Splunk, or Azure Monitor to centralize and analyze logs, allowing for faster identification of trends and potential issues.

Conclusion

Proactive monitoring and alerting ensure that SQL Server remains resilient, responsive, and optimized for performance over time. By defining KPIs, setting up automated alerts, monitoring query and resource usage, and utilizing both built-in and third-party tools, you can identify performance issues early and take corrective actions before they impact end-users. Implementing a structured monitoring and alerting strategy is essential for long-term SQL Server performance optimization and helps your team stay one step ahead of potential bottlenecks.

Performance Tuning Series – Regular Maintenance

SQL Server performance isn’t just about the initial setup or database design—it requires continuous maintenance to ensure it runs smoothly over time. Neglecting regular maintenance can lead to fragmentation, slow queries, data integrity issues, and ultimately, downtime. By implementing a comprehensive maintenance strategy, you can ensure that your SQL Server databases remain optimized and healthy, allowing them to perform efficiently even as workloads and data volumes grow.

Why Regular Maintenance is Critical

Over time, SQL Server databases accumulate various inefficiencies that can degrade performance. These include fragmented indexes, outdated statistics, growing transaction logs, and unused or bloated data. Regular maintenance tasks help to mitigate these issues, ensuring that SQL Server can continue to execute queries quickly, handle transactions efficiently, and maintain data integrity.

Key Regular Maintenance Tasks for SQL Server

1. Index Rebuilding and Reorganization

Indexes can become fragmented over time as data is inserted, updated, or deleted. Fragmentation occurs when the logical order of pages in an index no longer matches the physical order on disk. This leads to slower reads, as SQL Server must perform additional I/O to retrieve scattered data.

  • Rebuild Indexes: Index rebuilds recreate the index from scratch, removing fragmentation and improving query performance. This process locks the table, so it should be scheduled during periods of low activity.
  • Reorganize Indexes: Reorganizing indexes is a less intrusive process that defragments them without locking the table. This can be done during regular operations but is less effective than a full rebuild.

Best Practice:

  • Schedule regular index maintenance based on the level of fragmentation. Use SQL Server’s sys.dm_db_index_physical_stats DMV to check fragmentation levels:
    • 0-10% fragmentation: No action needed.
    • 10-30% fragmentation: Use index reorganization.
    • Above 30% fragmentation: Perform a full index rebuild.
  • Automate index maintenance using SQL Server Agent jobs or a dedicated maintenance tool to ensure this task is performed regularly without manual intervention.

2. Update Statistics

SQL Server uses statistics to estimate the distribution of data values in a table, which helps the query optimizer choose the most efficient execution plan. As data is modified, these statistics can become outdated, leading to suboptimal query plans and slower performance.

  • Best Practice:
    • Regularly update statistics on your tables and indexes to ensure that the query optimizer has the most accurate information. Use the UPDATE STATISTICS command or enable SQL Server’s auto-update statistics feature.
    • For large tables, use sampled statistics to balance performance with accuracy. Full scans of very large tables can be resource-intensive.
    • If queries slow down unexpectedly, manually update statistics to resolve potential performance issues caused by outdated statistics.

3. Backup and Recovery Management

Regular backups are essential for data protection and business continuity. However, improper backup strategies can lead to bloated transaction logs, excessive disk usage, and even performance degradation during peak times.

  • Best Practice:
    • Implement a full backup strategy based on your business’s recovery point objectives (RPO) and recovery time objectives (RTO). Schedule full backups regularly (e.g., daily) and transaction log backups more frequently (e.g., every 15-30 minutes) for critical databases.
    • Use differential backups between full backups to reduce the load on storage and improve recovery times.
    • Regularly test your backups by restoring them to a separate environment to ensure that they can be successfully recovered when needed.
    • Ensure that backup schedules avoid peak activity times to prevent any impact on performance.

4. Transaction Log Management

SQL Server’s transaction log records every modification made to the database. If not properly managed, the transaction log can grow excessively large, consuming valuable disk space and degrading performance.

  • Best Practice:
    • Use the Full Recovery Model for critical databases to ensure point-in-time recovery, but regularly back up the transaction logs to prevent them from growing too large.
    • For less critical databases or databases that don’t require point-in-time recovery, consider using the Simple Recovery Model, which automatically truncates the transaction log after each checkpoint.
    • Monitor transaction log size and schedule log backups frequently to avoid excessive growth. Use the sys.dm_db_log_space_usage DMV to monitor log space consumption.

5. Integrity Checks (DBCC CHECKDB)

Database corruption can occur for various reasons, such as hardware failures or improper shutdowns. SQL Server provides the DBCC CHECKDB command to detect and repair corruption in your databases.

  • Best Practice:
    • Run DBCC CHECKDB regularly to ensure data integrity. This process checks for physical and logical corruption in database files.
    • Schedule DBCC CHECKDB during off-peak hours to avoid performance impacts, as this operation can be resource-intensive.
    • If DBCC CHECKDB identifies corruption, address the issue immediately. Use repair options like REPAIR_ALLOW_DATA_LOSS as a last resort, and restore from backups when possible.

6. Tempdb Maintenance

Tempdb is a shared system database that is heavily used by SQL Server for temporary objects, intermediate query results, and sorting. Over time, tempdb can become a performance bottleneck if it is not properly managed.

  • Best Practice:
    • Ensure that tempdb has multiple data files, especially in high-concurrency environments. Best practice is to configure one data file per logical CPU core (up to 8 cores), which helps reduce contention on system pages (like PFS, GAM, and SGAM).
    • Place tempdb on fast storage (preferably SSD or NVMe) to handle its high I/O workload.
    • Regularly monitor tempdb space usage to avoid running out of space, which can lead to system crashes. Use sys.dm_db_task_space_usage and sys.dm_db_session_space_usage to track space consumption.

7. Cleanup of Unused or Outdated Data

Over time, databases may accumulate unused data, which can increase table sizes and slow down queries. Regularly cleaning up obsolete data ensures your database remains efficient.

  • Best Practice:
    • Implement a data retention policy that defines how long data should be kept before being archived or deleted. This policy should reflect business requirements while keeping database sizes manageable.
    • Periodically archive old data that is not frequently accessed into separate databases or storage systems.
    • Use automated scripts to clean up old or unused records, freeing up space and reducing index bloat.

8. Monitor and Tune Performance

SQL Server provides several tools for monitoring performance, including Dynamic Management Views (DMVs) and Extended Events. Regular monitoring can help identify performance bottlenecks before they affect end-users.

  • Best Practice:
    • Regularly monitor key performance metrics like CPU usage, memory usage, disk I/O, and query execution times to ensure the system is operating within optimal thresholds.
    • Use SQL Server Profiler or Extended Events to capture detailed information about query performance and diagnose slow-running queries.
    • Leverage DMV’s to analyze query patterns and suggest optimizations like new indexes.
    • Continuously review and tune your queries, indexes, and database schema based on real-time performance data.

9. Automating Maintenance Tasks

Manually managing routine maintenance tasks can be time-consuming and prone to error. SQL Server provides built-in automation tools, such as SQL Server Agent, to schedule and manage maintenance operations.

  • Best Practice:
    • Set up automated maintenance jobs for tasks like index rebuilding, statistics updates, transaction log backups, and integrity checks. This ensures that these critical operations are performed consistently and without manual intervention.
    • Regularly review and adjust job schedules to avoid conflicts during peak business hours.
    • Use maintenance plans in SQL Server Management Studio (SSMS) or third-party tools for more advanced scheduling and management of maintenance tasks.

Conclusion

Regular maintenance is essential for keeping your SQL Server database healthy and optimized for performance. Tasks like index rebuilding, updating statistics, managing transaction logs, performing integrity checks, and cleaning up obsolete data all contribute to the overall efficiency and reliability of your SQL Server environment. By automating maintenance operations, monitoring key performance metrics, and regularly tuning the system, you can ensure that your SQL Server databases continue to deliver optimal performance as your workload and data volumes grow.

Detecting CPU Pressure

There are a few different ways to detect CPU pressure, for example, Performance Monitor or SQL Profiler, but in this tip we’ll discuss the wait type SOS_SCHEDULER_YIELD and the DMV sys.dm_os_schedulers.

Some of you might have seen the wait type SOS_SCHEDULER_YIELD and wondered what in the world is this? Books Online (BOL) describes this wait type as:

Occurs when a task voluntarily yields the scheduler for other tasks to execute. During this wait the task is waiting for its quantum to be renewed.

SOS_SCHEDULER_YIELD is a fairly common wait type and occurs when there is CPU pressure. SQL Server runs multiple threads and tries to allow all threads to run without problems. However, if all the threads are busy on each scheduler and can’t let other threads run, it will yield itself for another thread, which in turn creates the SOS_SCHEDULER_YIELD wait type.

You can see the top wait stats and the percentage of waits at the current point in time by running the following query by Tim Ford and Glenn Berry.

WITH Waits AS 
( 
SELECT 
wait_type, 
wait_time_ms / 1000. AS wait_time_s, 
100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct, 
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn 
FROM sys.dm_os_wait_stats 
WHERE wait_type 
NOT IN 
('CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 
'SLEEP_TASK', 'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 
'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT') 
) -- filter out additional irrelevant waits 
SELECT W1.wait_type, 
CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s, 
CAST(W1.pct AS DECIMAL(12, 2)) AS pct, 
CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct 
FROM Waits AS W1 
INNER JOIN Waits AS W2 ON W2.rn <= W1.rn 
GROUP BY W1.rn, 
W1.wait_type, 
W1.wait_time_s, 
W1.pct 
HAVING SUM(W2.pct) - W1.pct < 95; -- percentage threshold;

After running this query on one of my servers I can see that only 2.22% of my waits are due to CPU pressure so I can rule that out as a cause for poor performance. (This is only current since the last SQL Service restart):

SQL Freelancer SQL Server CPU Pressure DMV

If you see a high percentage of this wait type then one way of troubleshooting this is to use the DMV sys.dm_os_schedulers. BOL describes sys.dm_os_schedulers as:

Returns one row per scheduler in SQL Server where each scheduler is mapped to an individual processor. Use this view to monitor the condition of a scheduler or to identify runaway tasks.

This DMV has several columns including number or workers, active tasks, and the status of each scheduler, etc., which can help troubleshoot certain problems, but the most important column is the one used for measuring queued tasks, the runnable_tasks_count column. This column displays the count of tasks yielding to other tasks resulting in a SOS_SCHEDULER_YIELD wait type. If this column is frequently greater than 0 then CPU pressure may be present and blocking may occur.

To see an average of current tasks and current waiting tasks you can use the following query:

SELECT AVG(current_tasks_count) AS [Avg Current Task], 
AVG(runnable_tasks_count) AS [Avg Wait Task]
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255
AND status = 'VISIBLE ONLINE'

SQL Freelancer SQL Server CPU Pressure DMV
Click here to view the rest of this post.