Mongo CPU Overload Investigation and Logs

Mongo CPU Overload Investigation and Logs#

Created: June 22, 2021 5:52 PM

Phenomenon#

In the production environment, the CPU utilization of the primary database periodically exceeds 80% within 5 minutes at specific times (04:05, 04:15, 07:05, 07:15, 11:05) and non-periodically (04:15-04:30, 11:05-11:15).

Due to the periodic and non-continuous nature of the CPU overload, it is reasonable to suspect that there is a problem with scheduled tasks.
Filter relevant scheduled tasks based on the alert time on AWS EC2.

Objective Reasons for CPU Overload#

Full Collection Scan (COLLSCAN)#

What: COLLSCAN is a high CPU-consuming operation that scans all records in a collection or database once and filters them according to conditions. COLLSCAN keyword can be found in the profile collection or logs, and if this operation is frequent, it is necessary to optimize it by creating indexes for the query fields.
How to locate: You can determine if a slow query full table search has occurred by checking the planSummary parameter in the Profile.
How to address: Create appropriate indexes.

Improper Indexing#

What: Improper indexing can cause excessive scanning of indexes during queries, resulting in high CPU overhead.
How to locate: You can locate index queries by checking the planSummary value in system.profile or Query Plans, and determine if the index is reasonable based on keysExamined.
How to address: Create appropriate indexes.

Sorting Large Data Sets#

What: When a sorting request cannot be completed through an index, MongoDB will sort the data in memory. If the data set is large, it can cause high CPU usage.
How to locate: When you find the SORT keyword in the system.profile collection, you can consider optimizing the sorting by using indexes. The hasSortStage field in the system.profile indicates if the request includes a sorting stage.
How to address: Create indexes for sorting purposes.

Exceeding Service Capacity#

If all the above major issues have been investigated and all queries and indexes are optimized, it indicates that the service capacity cannot keep up with the business volume.
In this case, consider upgrading the configuration.

How to Scientifically Create Indexes in MongoDB

Read the documentation more: [https://docs.mongodb.com/manual/indexes/]
High selectivity: distinct value count / total value count
No null values: do not create indexes for null values
Frequently used for updates and query filters
Ascending and descending order:
- In a single-field index, the effect of ascending and descending order is the same because MongoDB's index is sorted and queries can move up and down.
- In a compound index, the ascending or descending order of the second index field will affect the query performance. For example, if the second field is the order time, a descending index can lock the order more quickly when updating orders.
Do not blindly create indexes, as indexes have creation costs during build and insertion.
This topic is too large, we will expand on it later.

Monitoring and Identifying the Above Problems#

CurrentOp - Current Operation Information#

During peak periods, use db.currentOp() to check the status of the overloaded database. The meaning of the return values can be found at [https://docs.mongodb.com/manual/reference/command/currentOp/#currentop-output-fields].

Profiler - Performance Analysis Logs#

MongoDB's query performance logging feature is disabled by default. It is recommended to only enable it in the pre-production environment due to its slight impact on database performance.
The main function is to store database commands and their execution information in the profile collection under each db's system folder. The size of this collection is fixed, and when it exceeds the set size, old logs are automatically deleted. Currently, it can store up to 950 recent records (thanks to large-scale slow query reconciliation).
Profiler is the log version of currentOp, as the return parameters of both are exactly the same.
Profiler can be set using db.setProfilingLevel(level), with three levels: 0 (disabled), 1 (filter records based on the slowms threshold, where slowms is the number of milliseconds the operation runs), 2 (record all operation logs).
You can use db.getProfilingStatus() to get the current Profiler settings for the db.

Pay special attention to the following parameters#

secs_running

The running time of the operation in seconds.

microsecs_running

The running time of the operation in milliseconds.
How long is considered slow?
- By default, MongoDB considers queries that take more than 100ms as slow queries. In practice, we have a 30s full table scan query. Therefore, I set it to 200ms first, and we can discuss it based on the complexity of the business.

planSummary

COLLSCAN: Full collection scan
IXSCAN { field_name: 1 }: Index query - indicates which index is used internally
IDHACK: Primary key index query - skips other query plans and provides a more optimized query based on the primary key. Appears in the query when there is a _id value.

keysExamined

Used to determine if the existing index is reasonable for this query.
Indicates how many indexes were scanned before completing the query. The larger the value, the higher the CPU overhead.

Locating Problematic Code#

Periodic#

04:05 (5-minute average, so starting from 04:00, UTC+8=12:05)

sync_transaction_data
Task A (once every 4 hours) - mainly MySQL Membership reading and writing to the Redis api db

04:15 (12:15)

Task B

07:05 (15:05)

sync_transaction_data (second time at 3 PM)
Task C (mainly operating on MySQL read replica)

11:05 (19:05)

sync_transaction_data (third time at 7 PM)
Task C

Non-periodic#

(04:15-04:30, 11:05-11:15) - Based on the time, it should be a subsequent issue with sync_transaction_data
But still, let's take a look at the profile

Other Possible Causes#

If the periodic and non-periodic issues have been ruled out and the problem is still not resolved, continue researching.

Practice#

Enable Profile performance monitoring in the Freeze environment, set it to 1, and set the threshold to 200.
Monitor the running status of asynchronous tasks using Flower.
Monitor CPU usage using AWS CloudWatch.
Execute sync_transaction_data.delay() in the Freeze Shell.
Open MongoDB and check the slow query logs in the profile collection under Bill.
Found over 900 logs like this [1], and it can be noticed that this is a full table scan.
Three fields: transaction_time, channel, settlement_sign. The last two are enumeration fields with only 5 categories and a Boolean field with only 0 and 1. The selectivity is too low. Only the first field, transaction_time, is suitable.
After adding an index, CPU usage dropped from 99.9% to 3%, and the asynchronous task time decreased from 900s to 19s.

Maintenance and Prevention#

Enable Profile performance monitoring in the Freeze environment.
Regularly check and handle slow queries.

{
  "op":"query",
  "ns":"bill.transaction",
  "keysExamined":0,
  "query":{  
     "find":"transaction",
     "filter":{  
        "transaction_time":{  
           "$gte":{"$date":"2018-05-20T23:00:00.000+0000"},
           "$lte":{"$date":"2018-05-21T22:59:59.999+0000"}},
        "channel":"非再惠渠道",
        "settlement_sign":{  
           "$in":[1]}}
  },
  "docsExamined":2882636,
  "cursorExhausted":true,
  "numYield":22635,
  "nreturned":0,
  "responseLength":104,
  "protocol":"op_query",
  "millis":15131,
  "planSummary":"COLLSCAN",
  "execStats":{
     "stage":"COLLSCAN",
     "nReturned":0,
     "executionTimeMillisEstimate":14727,
     "works":2882638,
     "advanced":0,
     "needTime":2882637,
     "needYield":0,
     "saveState":22635,
     "restoreState":22635,
     "isEOF":1,
     "invalidates":0,
     "direction":"forward",
     "docsExamined":2882636
  },
  "ts":{
     "$date":"2018-05-23T12:30:42.539+0000"
  }
}