# 运维文档 ## 版本历史 | 版本 | 日期 | 作者 | 变更说明 | |------|------|------|----------| | 1.0 | 2024-01-20 | 运维团队 | 初始版本 | ## 1. 运维概述 ### 1.1 运维目标 确保畜牧养殖管理平台7×24小时稳定运行,提供高可用、高性能、安全可靠的服务。 ### 1.2 运维职责 - **系统监控**:实时监控系统运行状态 - **故障处理**:快速响应和处理系统故障 - **性能优化**:持续优化系统性能 - **安全管理**:维护系统安全防护 - **备份恢复**:确保数据安全和可恢复性 - **容量规划**:预测和规划系统容量需求 ### 1.3 服务等级协议(SLA) | 指标 | 目标值 | 说明 | |------|--------|------| | 系统可用性 | 99.9% | 年度停机时间不超过8.76小时 | | 响应时间 | < 500ms | API平均响应时间 | | 故障恢复时间 | < 30分钟 | 从故障发生到服务恢复 | | 数据备份 | 每日备份 | 保留30天备份数据 | | 安全事件响应 | < 15分钟 | 安全事件响应时间 | ## 2. 系统架构监控 ### 2.1 监控架构图 ```mermaid graph TB subgraph "监控数据收集" A[Node Exporter] --> P[Prometheus] B[MySQL Exporter] --> P C[Redis Exporter] --> P D[Nginx Exporter] --> P E[Application Metrics] --> P end subgraph "告警系统" P --> AM[AlertManager] AM --> DT[钉钉通知] AM --> WX[企业微信] AM --> SMS[短信告警] AM --> EMAIL[邮件告警] end subgraph "可视化展示" P --> G[Grafana] G --> DB[Dashboard] end subgraph "日志系统" F[Filebeat] --> L[Logstash] L --> ES[Elasticsearch] ES --> K[Kibana] end ``` ### 2.2 监控指标体系 #### 2.2.1 基础设施监控 ```yaml # prometheus/rules/infrastructure.yml groups: - name: infrastructure rules: # CPU使用率告警 - alert: HighCPUUsage expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "CPU使用率过高" description: "实例 {{ $labels.instance }} CPU使用率为 {{ $value }}%" # 内存使用率告警 - alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80 for: 5m labels: severity: warning annotations: summary: "内存使用率过高" description: "实例 {{ $labels.instance }} 内存使用率为 {{ $value }}%" # 磁盘使用率告警 - alert: HighDiskUsage expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85 for: 5m labels: severity: critical annotations: summary: "磁盘使用率过高" description: "实例 {{ $labels.instance }} 磁盘使用率为 {{ $value }}%" # 磁盘IO告警 - alert: HighDiskIO expr: irate(node_disk_io_time_seconds_total[5m]) * 100 > 80 for: 5m labels: severity: warning annotations: summary: "磁盘IO使用率过高" description: "实例 {{ $labels.instance }} 磁盘IO使用率为 {{ $value }}%" ``` #### 2.2.2 应用服务监控 ```yaml # prometheus/rules/application.yml groups: - name: application rules: # API响应时间告警 - alert: HighAPIResponseTime expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "API响应时间过长" description: "API 95%分位响应时间为 {{ $value }}秒" # API错误率告警 - alert: HighAPIErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "API错误率过高" description: "API错误率为 {{ $value | humanizePercentage }}" # 服务实例下线告警 - alert: ServiceInstanceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "服务实例下线" description: "实例 {{ $labels.instance }} 已下线" # 数据库连接数告警 - alert: HighDatabaseConnections expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8 for: 5m labels: severity: warning annotations: summary: "数据库连接数过高" description: "数据库连接数使用率为 {{ $value | humanizePercentage }}" ``` #### 2.2.3 业务指标监控 ```yaml # prometheus/rules/business.yml groups: - name: business rules: # 用户注册异常告警 - alert: LowUserRegistration expr: rate(user_registrations_total[1h]) < 0.1 for: 30m labels: severity: warning annotations: summary: "用户注册量异常" description: "过去1小时用户注册量为 {{ $value }}" # 交易失败率告警 - alert: HighTransactionFailureRate expr: rate(transactions_total{status="failed"}[5m]) / rate(transactions_total[5m]) > 0.1 for: 5m labels: severity: critical annotations: summary: "交易失败率过高" description: "交易失败率为 {{ $value | humanizePercentage }}" # 支付异常告警 - alert: PaymentAbnormal expr: rate(payments_total{status="failed"}[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "支付异常" description: "支付失败率为 {{ $value }}" ``` ### 2.3 Grafana仪表板配置 #### 2.3.1 系统概览仪表板 ```json { "dashboard": { "title": "系统概览", "panels": [ { "title": "系统负载", "type": "stat", "targets": [ { "expr": "avg(node_load1)", "legendFormat": "1分钟负载" } ] }, { "title": "CPU使用率", "type": "graph", "targets": [ { "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "legendFormat": "{{ instance }}" } ] }, { "title": "内存使用率", "type": "graph", "targets": [ { "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100", "legendFormat": "{{ instance }}" } ] }, { "title": "网络流量", "type": "graph", "targets": [ { "expr": "irate(node_network_receive_bytes_total[5m])", "legendFormat": "接收 - {{ instance }}" }, { "expr": "irate(node_network_transmit_bytes_total[5m])", "legendFormat": "发送 - {{ instance }}" } ] } ] } } ``` #### 2.3.2 应用性能仪表板 ```json { "dashboard": { "title": "应用性能", "panels": [ { "title": "API请求量", "type": "graph", "targets": [ { "expr": "rate(http_requests_total[5m])", "legendFormat": "{{ method }} {{ path }}" } ] }, { "title": "API响应时间", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "50%分位" }, { "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "95%分位" }, { "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))", "legendFormat": "99%分位" } ] }, { "title": "错误率", "type": "graph", "targets": [ { "expr": "rate(http_requests_total{status=~\"4..\"}[5m])", "legendFormat": "4xx错误" }, { "expr": "rate(http_requests_total{status=~\"5..\"}[5m])", "legendFormat": "5xx错误" } ] } ] } } ``` ## 3. 日常运维操作 ### 3.1 日常检查清单 ```bash #!/bin/bash # daily-check.sh - 日常检查脚本 LOG_FILE="/var/log/daily-check.log" DATE=$(date '+%Y-%m-%d %H:%M:%S') echo "=== 日常检查开始 $DATE ===" | tee -a $LOG_FILE # 1. 检查系统资源 echo "1. 系统资源检查" | tee -a $LOG_FILE echo "CPU负载: $(uptime | awk -F'load average:' '{print $2}')" | tee -a $LOG_FILE echo "内存使用: $(free -h | grep Mem | awk '{print $3"/"$2}')" | tee -a $LOG_FILE echo "磁盘使用: $(df -h / | tail -1 | awk '{print $5}')" | tee -a $LOG_FILE # 2. 检查服务状态 echo "2. 服务状态检查" | tee -a $LOG_FILE services=("mysql-master" "redis-master" "mongodb" "backend-api-1" "backend-api-2" "nginx") for service in "${services[@]}"; do if docker ps --format "{{.Names}}" | grep -q "^${service}$"; then echo "✅ $service 运行正常" | tee -a $LOG_FILE else echo "❌ $service 服务异常" | tee -a $LOG_FILE fi done # 3. 检查网络连接 echo "3. 网络连接检查" | tee -a $LOG_FILE echo "HTTP连接数: $(netstat -an | grep :80 | grep ESTABLISHED | wc -l)" | tee -a $LOG_FILE echo "HTTPS连接数: $(netstat -an | grep :443 | grep ESTABLISHED | wc -l)" | tee -a $LOG_FILE # 4. 检查数据库状态 echo "4. 数据库状态检查" | tee -a $LOG_FILE mysql_connections=$(docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW STATUS LIKE 'Threads_connected';" | tail -1 | awk '{print $2}') echo "MySQL连接数: $mysql_connections" | tee -a $LOG_FILE redis_connections=$(docker exec redis-master redis-cli info clients | grep connected_clients | cut -d: -f2) echo "Redis连接数: $redis_connections" | tee -a $LOG_FILE # 5. 检查日志错误 echo "5. 日志错误检查" | tee -a $LOG_FILE error_count=$(docker logs backend-api-1 --since="24h" 2>&1 | grep -i error | wc -l) echo "后端错误日志数量: $error_count" | tee -a $LOG_FILE # 6. 检查备份状态 echo "6. 备份状态检查" | tee -a $LOG_FILE backup_today=$(ls /backup/ | grep $(date +%Y%m%d) | wc -l) echo "今日备份文件数量: $backup_today" | tee -a $LOG_FILE echo "=== 日常检查完成 ===" | tee -a $LOG_FILE ``` ### 3.2 性能优化操作 #### 3.2.1 数据库性能优化 ```sql -- MySQL性能优化查询 -- 1. 查看慢查询 SELECT * FROM mysql.slow_log WHERE start_time > DATE_SUB(NOW(), INTERVAL 1 DAY); -- 2. 查看表锁等待 SHOW PROCESSLIST; -- 3. 查看索引使用情况 SELECT table_schema, table_name, index_name, cardinality, sub_part, packed, nullable, index_type FROM information_schema.statistics WHERE table_schema = 'xlxumu_db'; -- 4. 查看表大小 SELECT table_name, ROUND(((data_length + index_length) / 1024 / 1024), 2) AS 'Size (MB)' FROM information_schema.tables WHERE table_schema = 'xlxumu_db' ORDER BY (data_length + index_length) DESC; ``` ```bash #!/bin/bash # mysql-optimize.sh - MySQL优化脚本 # 1. 分析表 echo "开始分析表..." docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e " ANALYZE TABLE users, farms, animals, transactions; " # 2. 优化表 echo "开始优化表..." docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e " OPTIMIZE TABLE users, farms, animals, transactions; " # 3. 检查表 echo "检查表完整性..." docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e " CHECK TABLE users, farms, animals, transactions; " echo "MySQL优化完成" ``` #### 3.2.2 Redis性能优化 ```bash #!/bin/bash # redis-optimize.sh - Redis优化脚本 # 1. 检查Redis内存使用 echo "Redis内存使用情况:" docker exec redis-master redis-cli info memory # 2. 检查慢查询 echo "Redis慢查询:" docker exec redis-master redis-cli slowlog get 10 # 3. 清理过期键 echo "清理过期键..." docker exec redis-master redis-cli --scan --pattern "*" | xargs -I {} docker exec redis-master redis-cli ttl {} # 4. 检查大键 echo "检查大键..." docker exec redis-master redis-cli --bigkeys echo "Redis优化完成" ``` ### 3.3 日志管理 #### 3.3.1 日志轮转配置 ```bash # /etc/logrotate.d/xlxumu /var/log/xlxumu/*.log { daily missingok rotate 30 compress delaycompress notifempty create 644 root root postrotate docker kill -s USR1 $(docker ps -q --filter name=backend-api) endscript } /var/log/nginx/*.log { daily missingok rotate 30 compress delaycompress notifempty create 644 nginx nginx postrotate docker exec nginx-lb nginx -s reopen endscript } ``` #### 3.3.2 日志分析脚本 ```bash #!/bin/bash # log-analysis.sh - 日志分析脚本 LOG_DIR="/var/log/xlxumu" REPORT_FILE="/tmp/log-report-$(date +%Y%m%d).txt" echo "=== 日志分析报告 $(date) ===" > $REPORT_FILE # 1. 错误日志统计 echo "1. 错误日志统计" >> $REPORT_FILE grep -i error $LOG_DIR/*.log | wc -l >> $REPORT_FILE # 2. 访问量统计 echo "2. 今日访问量统计" >> $REPORT_FILE grep "$(date +%d/%b/%Y)" /var/log/nginx/access.log | wc -l >> $REPORT_FILE # 3. 状态码统计 echo "3. HTTP状态码统计" >> $REPORT_FILE awk '{print $9}' /var/log/nginx/access.log | grep "$(date +%d/%b/%Y)" | sort | uniq -c | sort -nr >> $REPORT_FILE # 4. 慢请求统计 echo "4. 慢请求统计(>1s)" >> $REPORT_FILE awk '$NF > 1.0 {print $0}' /var/log/nginx/access.log | grep "$(date +%d/%b/%Y)" | wc -l >> $REPORT_FILE # 5. 热门API统计 echo "5. 热门API统计" >> $REPORT_FILE awk '{print $7}' /var/log/nginx/access.log | grep "$(date +%d/%b/%Y)" | grep "/api/" | sort | uniq -c | sort -nr | head -10 >> $REPORT_FILE echo "日志分析完成,报告保存至: $REPORT_FILE" ``` ## 4. 备份与恢复 ### 4.1 自动备份策略 ```bash #!/bin/bash # backup-system.sh - 系统备份脚本 BACKUP_DIR="/backup" DATE=$(date +%Y%m%d_%H%M%S) BACKUP_PATH="$BACKUP_DIR/xlxumu_$DATE" RETENTION_DAYS=30 # 创建备份目录 mkdir -p $BACKUP_PATH echo "开始系统备份: $DATE" # 1. 备份MySQL数据库 echo "备份MySQL数据库..." docker exec mysql-master mysqldump -u root -p${MYSQL_ROOT_PASSWORD} \ --single-transaction \ --routines \ --triggers \ --all-databases > $BACKUP_PATH/mysql_backup.sql if [ $? -eq 0 ]; then echo "✅ MySQL备份成功" else echo "❌ MySQL备份失败" exit 1 fi # 2. 备份Redis数据 echo "备份Redis数据..." docker exec redis-master redis-cli --rdb $BACKUP_PATH/redis_backup.rdb docker cp redis-master:/data/dump.rdb $BACKUP_PATH/redis_backup.rdb if [ $? -eq 0 ]; then echo "✅ Redis备份成功" else echo "❌ Redis备份失败" fi # 3. 备份MongoDB数据 echo "备份MongoDB数据..." docker exec mongodb mongodump --out $BACKUP_PATH/mongodb_backup if [ $? -eq 0 ]; then echo "✅ MongoDB备份成功" else echo "❌ MongoDB备份失败" fi # 4. 备份应用配置 echo "备份应用配置..." cp -r ./config $BACKUP_PATH/ cp -r ./nginx $BACKUP_PATH/ cp .env.production $BACKUP_PATH/ # 5. 备份上传文件 echo "备份上传文件..." if [ -d "./uploads" ]; then tar -czf $BACKUP_PATH/uploads.tar.gz ./uploads fi # 6. 压缩备份文件 echo "压缩备份文件..." cd $BACKUP_DIR tar -czf xlxumu_$DATE.tar.gz xlxumu_$DATE/ rm -rf xlxumu_$DATE/ # 7. 清理过期备份 echo "清理过期备份..." find $BACKUP_DIR -name "xlxumu_*.tar.gz" -mtime +$RETENTION_DAYS -delete # 8. 上传到云存储(可选) echo "上传备份到云存储..." # aws s3 cp xlxumu_$DATE.tar.gz s3://your-backup-bucket/ echo "系统备份完成: xlxumu_$DATE.tar.gz" # 9. 发送备份通知 curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \ -H 'Content-Type: application/json' \ -d "{\"msgtype\": \"text\",\"text\": {\"content\": \"系统备份完成: xlxumu_$DATE.tar.gz\"}}" ``` ### 4.2 数据恢复流程 ```bash #!/bin/bash # restore-system.sh - 系统恢复脚本 BACKUP_FILE=$1 BACKUP_DIR="/backup" if [ -z "$BACKUP_FILE" ]; then echo "使用方法: $0 " echo "可用备份文件:" ls -la $BACKUP_DIR/xlxumu_*.tar.gz exit 1 fi echo "开始系统恢复: $BACKUP_FILE" # 1. 解压备份文件 echo "解压备份文件..." cd $BACKUP_DIR tar -xzf $BACKUP_FILE BACKUP_NAME=$(basename $BACKUP_FILE .tar.gz) RESTORE_PATH="$BACKUP_DIR/$BACKUP_NAME" # 2. 停止服务 echo "停止服务..." docker-compose down # 3. 恢复MySQL数据库 echo "恢复MySQL数据库..." docker-compose -f docker-compose.mysql.yml up -d mysql-master sleep 30 docker exec -i mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} < $RESTORE_PATH/mysql_backup.sql if [ $? -eq 0 ]; then echo "✅ MySQL恢复成功" else echo "❌ MySQL恢复失败" exit 1 fi # 4. 恢复Redis数据 echo "恢复Redis数据..." docker cp $RESTORE_PATH/redis_backup.rdb redis-master:/data/dump.rdb docker restart redis-master # 5. 恢复MongoDB数据 echo "恢复MongoDB数据..." docker exec mongodb mongorestore $RESTORE_PATH/mongodb_backup # 6. 恢复应用配置 echo "恢复应用配置..." cp -r $RESTORE_PATH/config ./ cp -r $RESTORE_PATH/nginx ./ cp $RESTORE_PATH/.env.production ./ # 7. 恢复上传文件 echo "恢复上传文件..." if [ -f "$RESTORE_PATH/uploads.tar.gz" ]; then tar -xzf $RESTORE_PATH/uploads.tar.gz fi # 8. 重启服务 echo "重启服务..." docker-compose up -d # 9. 健康检查 echo "执行健康检查..." sleep 60 ./scripts/health-check.sh echo "系统恢复完成" ``` ## 5. 故障处理 ### 5.1 故障响应流程 ```mermaid graph TD A[故障发生] --> B[监控系统告警] B --> C[运维人员接收告警] C --> D[初步故障定位] D --> E{故障等级判断} E -->|P0严重| F[立即响应
15分钟内] E -->|P1重要| G[快速响应
30分钟内] E -->|P2一般| H[正常响应
2小时内] E -->|P3轻微| I[计划响应
24小时内] F --> J[故障处理] G --> J H --> J I --> J J --> K[服务恢复] K --> L[根因分析] L --> M[改进措施] M --> N[文档更新] ``` ### 5.2 常见故障处理手册 #### 5.2.1 服务无响应 ```bash #!/bin/bash # fix-service-unresponsive.sh SERVICE_NAME=$1 if [ -z "$SERVICE_NAME" ]; then echo "使用方法: $0 " exit 1 fi echo "处理服务无响应: $SERVICE_NAME" # 1. 检查容器状态 echo "1. 检查容器状态" docker ps -a | grep $SERVICE_NAME # 2. 检查容器日志 echo "2. 检查容器日志" docker logs --tail 100 $SERVICE_NAME # 3. 检查资源使用 echo "3. 检查资源使用" docker stats --no-stream $SERVICE_NAME # 4. 尝试重启服务 echo "4. 尝试重启服务" docker restart $SERVICE_NAME # 5. 等待服务启动 echo "5. 等待服务启动" sleep 30 # 6. 健康检查 echo "6. 执行健康检查" case $SERVICE_NAME in "backend-api-1") curl -f http://localhost:3001/health ;; "backend-api-2") curl -f http://localhost:3002/health ;; "nginx") curl -f http://localhost:80/health ;; esac if [ $? -eq 0 ]; then echo "✅ 服务恢复正常" else echo "❌ 服务仍然异常,需要进一步处理" fi ``` #### 5.2.2 数据库连接异常 ```bash #!/bin/bash # fix-database-connection.sh echo "处理数据库连接异常" # 1. 检查MySQL容器状态 echo "1. 检查MySQL容器状态" docker ps | grep mysql-master # 2. 检查MySQL进程 echo "2. 检查MySQL进程" docker exec mysql-master ps aux | grep mysql # 3. 检查MySQL连接数 echo "3. 检查MySQL连接数" docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW STATUS LIKE 'Threads_connected';" # 4. 检查MySQL慢查询 echo "4. 检查MySQL慢查询" docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW PROCESSLIST;" # 5. 检查MySQL错误日志 echo "5. 检查MySQL错误日志" docker logs --tail 50 mysql-master | grep -i error # 6. 重启MySQL服务(如果必要) read -p "是否需要重启MySQL服务?(y/n): " restart_mysql if [ "$restart_mysql" = "y" ]; then echo "重启MySQL服务..." docker restart mysql-master sleep 30 # 检查服务状态 docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SELECT 1;" if [ $? -eq 0 ]; then echo "✅ MySQL服务恢复正常" else echo "❌ MySQL服务仍然异常" fi fi ``` #### 5.2.3 磁盘空间不足 ```bash #!/bin/bash # fix-disk-space.sh echo "处理磁盘空间不足" # 1. 检查磁盘使用情况 echo "1. 磁盘使用情况" df -h # 2. 查找大文件 echo "2. 查找大文件(>100MB)" find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | head -20 # 3. 清理Docker资源 echo "3. 清理Docker资源" docker system prune -f docker volume prune -f docker image prune -a -f # 4. 清理日志文件 echo "4. 清理日志文件" find /var/log -name "*.log" -type f -mtime +7 -exec truncate -s 0 {} \; # 5. 清理临时文件 echo "5. 清理临时文件" rm -rf /tmp/* rm -rf /var/tmp/* # 6. 清理旧备份文件 echo "6. 清理旧备份文件" find /backup -name "*.tar.gz" -mtime +30 -delete # 7. 再次检查磁盘空间 echo "7. 清理后磁盘使用情况" df -h echo "磁盘空间清理完成" ``` ### 5.3 故障预防措施 #### 5.3.1 预防性维护脚本 ```bash #!/bin/bash # preventive-maintenance.sh echo "开始预防性维护" # 1. 系统更新 echo "1. 系统更新检查" yum check-update # 2. 清理系统缓存 echo "2. 清理系统缓存" echo 3 > /proc/sys/vm/drop_caches # 3. 检查系统服务 echo "3. 检查系统服务" systemctl status docker systemctl status firewalld # 4. 检查网络连接 echo "4. 检查网络连接" netstat -tuln | grep -E "(80|443|3000|3306|6379|27017)" # 5. 检查SSL证书有效期 echo "5. 检查SSL证书有效期" openssl x509 -in /etc/letsencrypt/live/www.xlxumu.com/cert.pem -noout -dates # 6. 数据库维护 echo "6. 数据库维护" docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "OPTIMIZE TABLE xlxumu_db.users, xlxumu_db.farms, xlxumu_db.animals;" # 7. 性能基准测试 echo "7. 性能基准测试" curl -w "@curl-format.txt" -o /dev/null -s http://localhost/api/health echo "预防性维护完成" ``` ## 6. 安全运维 ### 6.1 安全检查清单 ```bash #!/bin/bash # security-check.sh echo "=== 安全检查开始 ===" # 1. 检查系统用户 echo "1. 检查系统用户" awk -F: '$3 >= 1000 {print $1}' /etc/passwd # 2. 检查SSH配置 echo "2. 检查SSH配置" grep -E "(PermitRootLogin|PasswordAuthentication|Port)" /etc/ssh/sshd_config # 3. 检查防火墙状态 echo "3. 检查防火墙状态" firewall-cmd --list-all # 4. 检查开放端口 echo "4. 检查开放端口" netstat -tuln # 5. 检查失败登录尝试 echo "5. 检查失败登录尝试" grep "Failed password" /var/log/secure | tail -10 # 6. 检查文件权限 echo "6. 检查关键文件权限" ls -la /etc/passwd /etc/shadow /etc/ssh/sshd_config # 7. 检查Docker安全 echo "7. 检查Docker安全" docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \ -v /usr/local/bin/docker:/usr/local/bin/docker \ docker/docker-bench-security # 8. 检查SSL证书 echo "8. 检查SSL证书" echo | openssl s_client -servername www.xlxumu.com -connect www.xlxumu.com:443 2>/dev/null | openssl x509 -noout -dates echo "=== 安全检查完成 ===" ``` ### 6.2 安全加固措施 ```bash #!/bin/bash # security-hardening.sh echo "开始安全加固" # 1. 禁用不必要的服务 echo "1. 禁用不必要的服务" systemctl disable telnet systemctl disable rsh systemctl disable rlogin # 2. 配置SSH安全 echo "2. 配置SSH安全" sed -i 's/#PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config systemctl restart sshd # 3. 配置防火墙规则 echo "3. 配置防火墙规则" firewall-cmd --permanent --remove-service=ssh firewall-cmd --permanent --add-port=2222/tcp firewall-cmd --reload # 4. 设置文件权限 echo "4. 设置文件权限" chmod 600 /etc/ssh/sshd_config chmod 644 /etc/passwd chmod 000 /etc/shadow # 5. 配置日志审计 echo "5. 配置日志审计" echo "auth.* /var/log/auth.log" >> /etc/rsyslog.conf systemctl restart rsyslog # 6. 安装入侵检测 echo "6. 安装入侵检测" yum install -y fail2ban systemctl enable fail2ban systemctl start fail2ban echo "安全加固完成" ``` ## 7. 容量规划 ### 7.1 容量监控指标 ```bash #!/bin/bash # capacity-monitoring.sh REPORT_FILE="/tmp/capacity-report-$(date +%Y%m%d).txt" echo "=== 容量监控报告 $(date) ===" > $REPORT_FILE # 1. 服务器资源使用趋势 echo "1. 服务器资源使用趋势" >> $REPORT_FILE echo "CPU使用率: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')" >> $REPORT_FILE echo "内存使用率: $(free | grep Mem | awk '{printf("%.2f%%"), $3/$2 * 100.0}')" >> $REPORT_FILE echo "磁盘使用率: $(df -h / | tail -1 | awk '{print $5}')" >> $REPORT_FILE # 2. 数据库容量分析 echo "2. 数据库容量分析" >> $REPORT_FILE docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e " SELECT table_schema AS '数据库', ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS '大小(MB)' FROM information_schema.tables WHERE table_schema = 'xlxumu_db' GROUP BY table_schema; " >> $REPORT_FILE # 3. 用户增长趋势 echo "3. 用户增长趋势" >> $REPORT_FILE docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e " SELECT DATE(created_at) as date, COUNT(*) as new_users FROM users WHERE created_at >= DATE_SUB(NOW(), INTERVAL 30 DAY) GROUP BY DATE(created_at) ORDER BY date DESC LIMIT 10; " >> $REPORT_FILE # 4. 存储空间预测 echo "4. 存储空间预测" >> $REPORT_FILE current_usage=$(df / | tail -1 | awk '{print $3}') growth_rate=5 # 假设每月增长5% echo "当前使用: ${current_usage}KB" >> $REPORT_FILE echo "预计3个月后: $((current_usage * (100 + growth_rate * 3) / 100))KB" >> $REPORT_FILE echo "容量监控报告生成完成: $REPORT_FILE" ``` ### 7.2 扩容建议 ```bash #!/bin/bash # scaling-recommendations.sh # 获取当前资源使用情况 cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//') mem_usage=$(free | grep Mem | awk '{printf("%.0f"), $3/$2 * 100.0}') disk_usage=$(df / | tail -1 | awk '{print $5}' | sed 's/%//') echo "=== 扩容建议 ===" # CPU扩容建议 if [ "$cpu_usage" -gt 70 ]; then echo "🔴 CPU使用率过高($cpu_usage%),建议:" echo " - 增加CPU核心数" echo " - 优化应用程序性能" echo " - 考虑水平扩展" elif [ "$cpu_usage" -gt 50 ]; then echo "🟡 CPU使用率较高($cpu_usage%),建议监控" else echo "🟢 CPU使用率正常($cpu_usage%)" fi # 内存扩容建议 if [ "$mem_usage" -gt 80 ]; then echo "🔴 内存使用率过高($mem_usage%),建议:" echo " - 增加内存容量" echo " - 优化内存使用" echo " - 检查内存泄漏" elif [ "$mem_usage" -gt 60 ]; then echo "🟡 内存使用率较高($mem_usage%),建议监控" else echo "🟢 内存使用率正常($mem_usage%)" fi # 磁盘扩容建议 if [ "$disk_usage" -gt 85 ]; then echo "🔴 磁盘使用率过高($disk_usage%),建议:" echo " - 立即清理磁盘空间" echo " - 扩展磁盘容量" echo " - 迁移数据到其他存储" elif [ "$disk_usage" -gt 70 ]; then echo "🟡 磁盘使用率较高($disk_usage%),建议监控" else echo "🟢 磁盘使用率正常($disk_usage%)" fi # 数据库扩容建议 db_connections=$(docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW STATUS LIKE 'Threads_connected';" | tail -1 | awk '{print $2}') max_connections=$(docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW VARIABLES LIKE 'max_connections';" | tail -1 | awk '{print $2}') connection_usage=$((db_connections * 100 / max_connections)) if [ "$connection_usage" -gt 80 ]; then echo "🔴 数据库连接使用率过高($connection_usage%),建议:" echo " - 增加最大连接数" echo " - 优化连接池配置" echo " - 考虑读写分离" fi ``` ## 8. 应急预案 ### 8.1 应急响应流程 ```bash #!/bin/bash # emergency-response.sh INCIDENT_TYPE=$1 SEVERITY=$2 case $INCIDENT_TYPE in "service_down") echo "服务下线应急处理" # 1. 立即切换到备用服务 # 2. 通知相关人员 # 3. 开始故障排查 ;; "data_corruption") echo "数据损坏应急处理" # 1. 立即停止写入操作 # 2. 启动数据恢复流程 # 3. 通知业务方 ;; "security_breach") echo "安全事件应急处理" # 1. 隔离受影响系统 # 2. 收集证据 # 3. 通知安全团队 ;; esac ``` ### 8.2 灾难恢复计划 ```bash #!/bin/bash # disaster-recovery.sh echo "=== 灾难恢复计划 ===" # 1. 评估损失程度 echo "1. 评估系统损失程度" # 2. 启动备用系统 echo "2. 启动备用系统" # 切换到备用数据中心 # 3. 数据恢复 echo "3. 开始数据恢复" # 从最近备份恢复数据 # 4. 服务验证 echo "4. 验证服务功能" # 执行完整的功能测试 # 5. 切换流量 echo "5. 切换用户流量" # 更新DNS指向新系统 echo "灾难恢复完成" ``` ## 9. 运维工具 ### 9.1 运维脚本集合 ```bash #!/bin/bash # ops-toolkit.sh - 运维工具箱 show_menu() { echo "=== 运维工具箱 ===" echo "1. 系统状态检查" echo "2. 服务重启" echo "3. 日志查看" echo "4. 性能监控" echo "5. 备份操作" echo "6. 故障诊断" echo "7. 安全检查" echo "8. 容量分析" echo "0. 退出" echo "=================" } while true; do show_menu read -p "请选择操作: " choice case $choice in 1) ./scripts/system-check.sh ;; 2) read -p "请输入服务名: " service docker restart $service ;; 3) read -p "请输入容器名: " container docker logs --tail 100 -f $container ;; 4) ./scripts/performance-monitor.sh ;; 5) ./scripts/backup-system.sh ;; 6) ./scripts/troubleshoot.sh ;; 7) ./scripts/security-check.sh ;; 8) ./scripts/capacity-monitoring.sh ;; 0) echo "退出运维工具箱" break ;; *) echo "无效选择,请重新输入" ;; esac echo "按回车键继续..." read done ``` ### 9.2 自动化运维脚本 ```bash #!/bin/bash # auto-ops.sh - 自动化运维 # 定时任务配置 setup_cron_jobs() { echo "配置定时任务" # 每日备份 echo "0 2 * * * /opt/xlxumu/scripts/backup-system.sh" >> /var/spool/cron/root # 每小时系统检查 echo "0 * * * * /opt/xlxumu/scripts/system-check.sh" >> /var/spool/cron/root # 每日日志清理 echo "0 3 * * * /opt/xlxumu/scripts/log-cleanup.sh" >> /var/spool/cron/root # 每周性能报告 echo "0 9 * * 1 /opt/xlxumu/scripts/performance-report.sh" >> /var/spool/cron/root systemctl restart crond } # 自动故障恢复 auto_recovery() { # 检查服务状态并自动重启 services=("mysql-master" "redis-master" "backend-api-1" "backend-api-2") for service in "${services[@]}"; do if ! docker ps | grep -q $service; then echo "检测到 $service 服务异常,尝试自动恢复" docker restart $service sleep 30 # 验证恢复结果 if docker ps | grep -q $service; then echo "$service 服务恢复成功" # 发送恢复通知 send_notification "服务自动恢复" "$service 服务已自动恢复正常" else echo "$service 服务恢复失败,需要人工介入" # 发送告警通知 send_alert "服务恢复失败" "$service 服务自动恢复失败,需要人工处理" fi fi done } # 发送通知 send_notification() { local title=$1 local message=$2 # 钉钉通知 curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \ -H 'Content-Type: application/json' \ -d "{\"msgtype\": \"text\",\"text\": {\"content\": \"$title: $message\"}}" } # 发送告警 send_alert() { local title=$1 local message=$2 # 发送邮件告警 echo "$message" | mail -s "$title" ops@xlxumu.com # 发送短信告警(集成短信服务) # curl -X POST "SMS_API_URL" -d "phone=13800000000&message=$message" } # 主函数 main() { case $1 in "setup") setup_cron_jobs ;; "recovery") auto_recovery ;; *) echo "使用方法: $0 {setup|recovery}" ;; esac } main "$@" ``` ## 10. 总结 ### 10.1 运维最佳实践 1. **预防为主**:通过监控和预防性维护减少故障发生 2. **自动化优先**:尽可能自动化日常运维操作 3. **文档完善**:维护详细的运维文档和操作手册 4. **持续改进**:根据运维经验不断优化流程和工具 5. **团队协作**:建立有效的运维团队协作机制 ### 10.2 关键指标监控 - **可用性**: 99.9%+ - **响应时间**: < 500ms - **错误率**: < 0.1% - **恢复时间**: < 30分钟 - **备份成功率**: 100% ### 10.3 持续优化方向 1. **监控体系完善**:增加更多业务指标监控 2. **自动化程度提升**:扩大自动化运维覆盖范围 3. **故障预测能力**:基于AI的故障预测和预防 4. **运维效率提升**:优化运维工具和流程 5. **安全防护加强**:持续加强安全防护措施 --- **文档版本**: v1.0.0 **最后更新**: 2024年12月 **维护团队**: 运维团队 **联系方式**: ops@xlxumu.com