MongoDB monitoring
MongoDB performance monitor plugin for Open-Falcon
Function Supported
Tested version: Version MongoDB 2.4, 2.6 3.0, 3.2, and Percona MongoDB3.0 are supported.
Storage engine supported: MMAPv1, wiredTiger, RocksDB, PerconaFT. (indexes of some storage engines are not collected, and can be added in the code).
Chondrophone: standlone, replica sets, zone clusters
Node supported: mongo data nodes, configuration nodes, Primary/Secondary, mongos. Arbite nodes are not supported.
Monitoring data acquisition principle: 1.survival monitoring: auth included.
2.serverStatus
3.replSetGetStatus
4.oplog.rs
5.mongos
Collect and report per minute by cron, and collecting makes no difference to MongoDB theory.
Environmental requirements:
Operating system: Linux
Python 2.6
PyYAML > 3.10
python-requests > 0.11
Mongomo deploy:
1.Decompress to directory /path/to/mongomon.
2.Deploy the MongoDB multi-case (mongod, configuration node, mongos) information of present server, /path/to/mongomon/conf/mongomon.conf. Record an instance per line: port, user name, password {port: 27017, user: "",password: ""}
3.Deploy crontab, and modify the mongomon installation path of file mongomon/conf/mongomon_cron to cp mongomon_cron /etc/cron.d/.
4.Check the MongoDB metric in the dashboard of open-falcon.
5.The default endpoint is hostname.
MongoDB falcon screen:
picture
MongoDB index collected:
| Counters | Type | Notes | 
|---|---|---|
| mongo_local_alive | GAUGE | Mongodb survival local monitoring, if unlocked Auth,Successful connection authentication is required. | 
| asserts_msg | COUNTER | Message asserts quantity /second | 
| asserts_regular | COUNTER | Regular asserts quantity/second | 
| asserts_rollovers | COUNTER | The times of counter roll over /second, and the counter will be reset every 2^30 asserts | 
| asserts_user | COUNTER | User asserts quantity/second | 
| asserts_warning | COUNTER | Warning asserts quantity/second | 
| page_faults | COUNTER | Page faults times/second | 
| connections_available | GAUGE | Unused available connection quantity | 
| connections_current | GAUGE | Connected connection quantity of all current clients | 
| connections_used_percent | GAUGE | Used connection quantity percent | 
| connections_totalCreated | COUNTER | New created connection quantity/second | 
| globalLock_currentQueue_total | GAUGE | Operating quantity waits to be locked in current queue | 
| globalLock_currentQueue_readers | GAUGE | Operating quantity waits to be read lock in current queue | 
| globalLock_currentQueue_writers | GAUGE | Operating quantity waits to write lock in current queue | 
| locks_Global_acquireCount_ISlock | COUNTER | Instance level intent shared lock acquisition times | 
| locks_Global_acquireCount_IXlock | COUNTER | Instance level intent exclusive shared lock acquisition times | 
| locks_Global_acquireCount_Slock | COUNTER | Instance level shared lock acquisition times | 
| locks_Global_acquireCount_Xlock | COUNTER | Instance level exclusive lock acquisition times | 
| locks_Global_acquireWaitCount_ISlock | COUNTER | Instance level intent shared lock waiting times | 
| locks_Global_acquireWaitCount_IXlock | COUNTER | Instance level intent exclusive lock waiting times | 
| locks_Global_timeAcquiringMicros_ISlock | COUNTER | Instance level shared lock acquisition time-consuming Unit: us | 
| locks_Global_timeAcquiringMicros_IXlock | COUNTER | Instance level exclusive acquisition time-consuming Unit: us | 
| locks_Database_acquireCount_ISlock | COUNTER | Database level intent shared lock acquisition times | 
| locks_Database_acquireCount_IXlock | COUNTER | Database level intent exclusive lock acquisition times | 
| locks_Database_acquireCount_Slock | COUNTER | Database level shared lock acquisition times | 
| locks_Database_acquireCount_Xlock | COUNTER | Database level exclusive lock acquisition times | 
| locks_Collection_acquireCount_ISlock | COUNTER | Set level intent shared lock acquisition times | 
| locks_Collection_acquireCount_IXlock | COUNTER | Set level intent exclusive lock acquisition times | 
| locks_Collection_acquireCount_Xlock | COUNTER | Set level exclusive lock acquisition times | 
| opcounters_command | COUNTER | All commands executed by database/second | 
| opcounters_insert | COUNTER | Insert operation times executed by database/second | 
| opcounters_delete | COUNTER | Delete operation times executed by database/second | 
| opcounters_update | COUNTER | Update operation times executed by database/second | 
| opcounters_query | COUNTER | Query operation times executed by database/second | 
| opcounters_getmore | COUNTER | Getmore operation times executed by database/second | 
| opcountersRepl_command | COUNTER | All command times copied and executed by database/second | 
| opcountersRepl_insert | COUNTER | Insert command times copied and executed by database/second | 
| opcountersRepl_delete | COUNTER | Delete command times copied and executed by database/second | 
| opcountersRepl_update | COUNTER | Update command times copied and executed by database/second | 
| opcountersRepl_query | COUNTER | Query command times copied and executed by database/second | 
| opcountersRepl_getmore | COUNTER | Getmore command times copied and executed by database/second | 
| network_bytesIn | COUNTER | Network transmission bytes received by database/second | 
| network_bytesOut | COUNTER | Network transmission bytes sent by database/second | 
| network_numRequests | COUNTER | Request times received by database/second | 
| mem_virtual | GAUGE | Virtual memory used by database process | 
| mem_resident | GAUGE | Physical memory used by database process | 
| mem_mapped | GAUGE | Mapped memory, only used formmapv1 storage engine | 
| mem_bits | GAUGE | 64 or 32bit | 
| mem_mappedWithJournal | GAUGE | Map memory consumed by journal, only used formmapv1 storage engine | 
| backgroundFlushing_flushes | COUNTER | Refresh writes times to disk by database/second | 
| backgroundFlushing_average_ms | GAUGE | Average time-consuming of refresh writes to disk by database,Unit ms | 
| backgroundFlushing_last_ms | COUNTER | Current latest time-consuming of refresh writes to disk by database,Unit ms | 
| backgroundFlushing_total_ms | GAUGE | Total time-consuming of refresh writes to disk by database,Unit ms | 
| cursor_open_total | GAUGE | Total cursor quantity maintained for clients by current database | 
| cursor_timedOut | COUNTER | Cursor quantity of database timout/second | 
| cursor_open_noTimeout | GAUGE | Cursor quantity of setting dbquery.Option.notimeout | 
| cursor_open_pinned | GAUGE | Cursor quantity of opened pinned | 
| repl_health | GAUGE | Copied healthy status | 
| repl_myState | GAUGE | Duplicate sets status of current node | 
| repl_oplog_window | GAUGE | Size of oplog window | 
| repl_optime | GAUGE | Time of last execution | 
| replication_lag_percent | GAUGE | Delayed percent(lag/oplog_window) | 
| repl_lag | GAUGE | Secondary copy delay,Unit s | 
| shards_size | GAUGE | Zone number of database cluster; config.shards.count | 
| shards_mongosSize | GAUGE | Mongos node number of database cluster; config.mongos.count | 
| shards_chunkSize | GAUGE | Chunksize size setting of database cluster,acquired in config.settings | 
| shards_activeWindow | GAUGE | Whether the time window is setup to the data balancer of database cluster,1/0 | 
| shards_activeWindow_start | GAUGE | Start time of time window of the data balancer of database cluster ,format 23.30 indicates 23:30 | 
| shards_activeWindow_stop | GAUGE | End time of time window of the data balancer of database cluster ,format 23.30 indicates 23:30 | 
| shards_BalancerState | GAUGE | The status of data balancer of database cluster,Whether it is open. | 
| shards_isBalancerRunning | GAUGE | Whether the data balancer of database cluster is taking block migration | 
| wt_cache_used_total_bytes | GAUGE | The total bytes of wiredtiger cache | 
| wt_cache_dirty_bytes | GAUGE | The bytes of "dirty" data in wiredtiger cache | 
| wt_cache_readinto_bytes | COUNTER | The bytes of database writing Into wiredtiger cache/second | 
| wt_cache_writtenfrom_bytes | COUNTER | The bytes of database writing into the disk from wiredtiger cache/second | 
| wt_concurrentTransactions_write | GAUGE | Write tickets available to the wiredtiger storage engine | 
| wt_concurrentTransactions_read | GAUGE | Read tickets available to the wiredtiger storage engine | 
| wt_bm_bytes_read | COUNTER | The bytes of block-manager read/second | 
| wt_bm_bytes_written | COUNTER | The bytes of block-manager write/second | 
| wt_bm_blocks_read | COUNTER | The number of block-manager read/second | 
| wt_bm_blocks_written | COUNTER | The number of block-manager write/second | 
| rocksdb_num_immutable_mem_table | ||
| rocksdb_mem_table_flush_pending | ||
| rocksdb_compaction_pending | ||
| rocksdb_background_errors | ||
| rocksdb_num_entries_active_mem_table | ||
| rocksdb_num_entries_imm_mem_tables | ||
| rocksdb_num_snapshots | ||
| rocksdb_oldest_snapshot_time | ||
| rocksdb_num_live_versions | ||
| rocksdb_total_live_recovery_units | ||
| PerconaFT_cachetable_size_current | ||
| PerconaFT_cachetable_size_limit | ||
| PerconaFT_cachetable_size_writing | ||
| PerconaFT_checkpoint_count | ||
| PerconaFT_checkpoint_time | ||
| PerconaFT_checkpoint_write_leaf_bytes_compressed | ||
| PerconaFT_checkpoint_write_leaf_bytes_uncompressed | ||
| PerconaFT_checkpoint_write_leaf_count | ||
| PerconaFT_checkpoint_write_leaf_time | ||
| PerconaFT_checkpoint_write_nonleaf_bytes_compressed | ||
| PerconaFT_checkpoint_write_nonleaf_bytes_uncompressed | ||
| PerconaFT_checkpoint_write_nonleaf_count | ||
| PerconaFT_checkpoint_write_nonleaf_time | ||
| PerconaFT_compressionRatio_leaf | ||
| PerconaFT_compressionRatio_nonleaf | ||
| PerconaFT_compressionRatio_overall | ||
| PerconaFT_fsync_count | ||
| PerconaFT_fsync_time | ||
| PerconaFT_log_bytes | ||
| PerconaFT_log_count | ||
| PerconaFT_log_time | ||
| PerconaFT_serializeTime_leaf_compress | ||
| PerconaFT_serializeTime_leaf_decompress | ||
| PerconaFT_serializeTime_leaf_deserialize | ||
| PerconaFT_serializeTime_leaf_serialize | ||
| PerconaFT_serializeTime_nonleaf_compress | ||
| PerconaFT_serializeTime_nonleaf_decompress | ||
| PerconaFT_serializeTime_nonleaf_deserialize | ||
| PerconaFT_serializeTime_nonleaf_serialize | 
Monitoring alarm items recommended to set:
Instruction: system level monitoring items are provided by falcon agent; monitoring triggering conditions are self-adjusted by the scenes.
| alarm command | 
|---|
| load.1min>10 | 
| cpu.idle<10 | 
| df.bytes.free.percent<30 | 
| df.bytes.free.percent<10 | 
| mem.memfree.percent<20 | 
| mem.memfree.percent<10 | 
| mem.memfree.percent<5 | 
| mem.swapfree.percent<50 | 
| mem.memused.percent>=50 | 
| mem.memused.percent>=10 | 
| net.if.out.bytes>94371840 | 
| net.if.in.bytes>94371840 | 
| disk.io.util>90 | 
| mongo_local_alive=0 | 
| page_faults>100 | 
| connections_current>5000 | 
| connections_used_percent>60 | 
| connections_used_percent>80 | 
| connections_totalCreated>1000 | 
| globalLock_currentQueue_total>10 | 
| globalLock_currentQueue_readers>10 | 
| globalLock_currentQueue_writers>10 | 
| opcounters_command | 
| opcounters_insert | 
| opcounters_delete | 
| opcounters_update | 
| opcounters_query | 
| opcounters_getmore | 
| opcountersRepl_command | 
| opcountersRepl_insert | 
| opcountersRepl_delete | 
| opcountersRepl_update | 
| opcountersRepl_query | 
| opcountersRepl_getmore | 
| network_bytesIn | 
| network_bytesOut | 
| network_numRequests | 
| repl_health=0 | 
| repl_myState not 1/2/7 | 
| repl_oplog_window<168 | 
| repl_oplog_window<48 | 
| replication_lag_percent>50 | 
| repl_lag>60 | 
| repl_lag>300 | 
| shards_mongosSize | 
Contributors
- Zhuo Rulin: mail:[email protected]; weibo: http://weibo.com/u/2540962412