Files
xlxumu/docs/operations/运维文档.md

1359 lines
34 KiB
Markdown
Raw Permalink Normal View History

# 运维文档
## 版本历史
| 版本 | 日期 | 作者 | 变更说明 |
|------|------|------|----------|
| 1.0 | 2024-01-20 | 运维团队 | 初始版本 |
## 1. 运维概述
### 1.1 运维目标
确保畜牧养殖管理平台7×24小时稳定运行提供高可用、高性能、安全可靠的服务。
### 1.2 运维职责
- **系统监控**:实时监控系统运行状态
- **故障处理**:快速响应和处理系统故障
- **性能优化**:持续优化系统性能
- **安全管理**:维护系统安全防护
- **备份恢复**:确保数据安全和可恢复性
- **容量规划**:预测和规划系统容量需求
### 1.3 服务等级协议(SLA)
| 指标 | 目标值 | 说明 |
|------|--------|------|
| 系统可用性 | 99.9% | 年度停机时间不超过8.76小时 |
| 响应时间 | < 500ms | API平均响应时间 |
| 故障恢复时间 | < 30分钟 | 从故障发生到服务恢复 |
| 数据备份 | 每日备份 | 保留30天备份数据 |
| 安全事件响应 | < 15分钟 | 安全事件响应时间 |
## 2. 系统架构监控
### 2.1 监控架构图
```mermaid
graph TB
subgraph "监控数据收集"
A[Node Exporter] --> P[Prometheus]
B[MySQL Exporter] --> P
C[Redis Exporter] --> P
D[Nginx Exporter] --> P
E[Application Metrics] --> P
end
subgraph "告警系统"
P --> AM[AlertManager]
AM --> DT[钉钉通知]
AM --> WX[企业微信]
AM --> SMS[短信告警]
AM --> EMAIL[邮件告警]
end
subgraph "可视化展示"
P --> G[Grafana]
G --> DB[Dashboard]
end
subgraph "日志系统"
F[Filebeat] --> L[Logstash]
L --> ES[Elasticsearch]
ES --> K[Kibana]
end
```
### 2.2 监控指标体系
#### 2.2.1 基础设施监控
```yaml
# prometheus/rules/infrastructure.yml
groups:
- name: infrastructure
rules:
# CPU使用率告警
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU使用率过高"
description: "实例 {{ $labels.instance }} CPU使用率为 {{ $value }}%"
# 内存使用率告警
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
description: "实例 {{ $labels.instance }} 内存使用率为 {{ $value }}%"
# 磁盘使用率告警
- alert: HighDiskUsage
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘使用率过高"
description: "实例 {{ $labels.instance }} 磁盘使用率为 {{ $value }}%"
# 磁盘IO告警
- alert: HighDiskIO
expr: irate(node_disk_io_time_seconds_total[5m]) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘IO使用率过高"
description: "实例 {{ $labels.instance }} 磁盘IO使用率为 {{ $value }}%"
```
#### 2.2.2 应用服务监控
```yaml
# prometheus/rules/application.yml
groups:
- name: application
rules:
# API响应时间告警
- alert: HighAPIResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "API响应时间过长"
description: "API 95%分位响应时间为 {{ $value }}秒"
# API错误率告警
- alert: HighAPIErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "API错误率过高"
description: "API错误率为 {{ $value | humanizePercentage }}"
# 服务实例下线告警
- alert: ServiceInstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务实例下线"
description: "实例 {{ $labels.instance }} 已下线"
# 数据库连接数告警
- alert: HighDatabaseConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "数据库连接数过高"
description: "数据库连接数使用率为 {{ $value | humanizePercentage }}"
```
#### 2.2.3 业务指标监控
```yaml
# prometheus/rules/business.yml
groups:
- name: business
rules:
# 用户注册异常告警
- alert: LowUserRegistration
expr: rate(user_registrations_total[1h]) < 0.1
for: 30m
labels:
severity: warning
annotations:
summary: "用户注册量异常"
description: "过去1小时用户注册量为 {{ $value }}"
# 交易失败率告警
- alert: HighTransactionFailureRate
expr: rate(transactions_total{status="failed"}[5m]) / rate(transactions_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "交易失败率过高"
description: "交易失败率为 {{ $value | humanizePercentage }}"
# 支付异常告警
- alert: PaymentAbnormal
expr: rate(payments_total{status="failed"}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "支付异常"
description: "支付失败率为 {{ $value }}"
```
### 2.3 Grafana仪表板配置
#### 2.3.1 系统概览仪表板
```json
{
"dashboard": {
"title": "系统概览",
"panels": [
{
"title": "系统负载",
"type": "stat",
"targets": [
{
"expr": "avg(node_load1)",
"legendFormat": "1分钟负载"
}
]
},
{
"title": "CPU使用率",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}
]
},
{
"title": "内存使用率",
"type": "graph",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "{{ instance }}"
}
]
},
{
"title": "网络流量",
"type": "graph",
"targets": [
{
"expr": "irate(node_network_receive_bytes_total[5m])",
"legendFormat": "接收 - {{ instance }}"
},
{
"expr": "irate(node_network_transmit_bytes_total[5m])",
"legendFormat": "发送 - {{ instance }}"
}
]
}
]
}
}
```
#### 2.3.2 应用性能仪表板
```json
{
"dashboard": {
"title": "应用性能",
"panels": [
{
"title": "API请求量",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{ method }} {{ path }}"
}
]
},
{
"title": "API响应时间",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "50%分位"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95%分位"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "99%分位"
}
]
},
{
"title": "错误率",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"4..\"}[5m])",
"legendFormat": "4xx错误"
},
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
"legendFormat": "5xx错误"
}
]
}
]
}
}
```
## 3. 日常运维操作
### 3.1 日常检查清单
```bash
#!/bin/bash
# daily-check.sh - 日常检查脚本
LOG_FILE="/var/log/daily-check.log"
DATE=$(date '+%Y-%m-%d %H:%M:%S')
echo "=== 日常检查开始 $DATE ===" | tee -a $LOG_FILE
# 1. 检查系统资源
echo "1. 系统资源检查" | tee -a $LOG_FILE
echo "CPU负载: $(uptime | awk -F'load average:' '{print $2}')" | tee -a $LOG_FILE
echo "内存使用: $(free -h | grep Mem | awk '{print $3"/"$2}')" | tee -a $LOG_FILE
echo "磁盘使用: $(df -h / | tail -1 | awk '{print $5}')" | tee -a $LOG_FILE
# 2. 检查服务状态
echo "2. 服务状态检查" | tee -a $LOG_FILE
services=("mysql-master" "redis-master" "mongodb" "backend-api-1" "backend-api-2" "nginx")
for service in "${services[@]}"; do
if docker ps --format "{{.Names}}" | grep -q "^${service}$"; then
echo "✅ $service 运行正常" | tee -a $LOG_FILE
else
echo "❌ $service 服务异常" | tee -a $LOG_FILE
fi
done
# 3. 检查网络连接
echo "3. 网络连接检查" | tee -a $LOG_FILE
echo "HTTP连接数: $(netstat -an | grep :80 | grep ESTABLISHED | wc -l)" | tee -a $LOG_FILE
echo "HTTPS连接数: $(netstat -an | grep :443 | grep ESTABLISHED | wc -l)" | tee -a $LOG_FILE
# 4. 检查数据库状态
echo "4. 数据库状态检查" | tee -a $LOG_FILE
mysql_connections=$(docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW STATUS LIKE 'Threads_connected';" | tail -1 | awk '{print $2}')
echo "MySQL连接数: $mysql_connections" | tee -a $LOG_FILE
redis_connections=$(docker exec redis-master redis-cli info clients | grep connected_clients | cut -d: -f2)
echo "Redis连接数: $redis_connections" | tee -a $LOG_FILE
# 5. 检查日志错误
echo "5. 日志错误检查" | tee -a $LOG_FILE
error_count=$(docker logs backend-api-1 --since="24h" 2>&1 | grep -i error | wc -l)
echo "后端错误日志数量: $error_count" | tee -a $LOG_FILE
# 6. 检查备份状态
echo "6. 备份状态检查" | tee -a $LOG_FILE
backup_today=$(ls /backup/ | grep $(date +%Y%m%d) | wc -l)
echo "今日备份文件数量: $backup_today" | tee -a $LOG_FILE
echo "=== 日常检查完成 ===" | tee -a $LOG_FILE
```
### 3.2 性能优化操作
#### 3.2.1 数据库性能优化
```sql
-- MySQL性能优化查询
-- 1. 查看慢查询
SELECT * FROM mysql.slow_log WHERE start_time > DATE_SUB(NOW(), INTERVAL 1 DAY);
-- 2. 查看表锁等待
SHOW PROCESSLIST;
-- 3. 查看索引使用情况
SELECT
table_schema,
table_name,
index_name,
cardinality,
sub_part,
packed,
nullable,
index_type
FROM information_schema.statistics
WHERE table_schema = 'xlxumu_db';
-- 4. 查看表大小
SELECT
table_name,
ROUND(((data_length + index_length) / 1024 / 1024), 2) AS 'Size (MB)'
FROM information_schema.tables
WHERE table_schema = 'xlxumu_db'
ORDER BY (data_length + index_length) DESC;
```
```bash
#!/bin/bash
# mysql-optimize.sh - MySQL优化脚本
# 1. 分析表
echo "开始分析表..."
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e "
ANALYZE TABLE users, farms, animals, transactions;
"
# 2. 优化表
echo "开始优化表..."
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e "
OPTIMIZE TABLE users, farms, animals, transactions;
"
# 3. 检查表
echo "检查表完整性..."
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e "
CHECK TABLE users, farms, animals, transactions;
"
echo "MySQL优化完成"
```
#### 3.2.2 Redis性能优化
```bash
#!/bin/bash
# redis-optimize.sh - Redis优化脚本
# 1. 检查Redis内存使用
echo "Redis内存使用情况:"
docker exec redis-master redis-cli info memory
# 2. 检查慢查询
echo "Redis慢查询:"
docker exec redis-master redis-cli slowlog get 10
# 3. 清理过期键
echo "清理过期键..."
docker exec redis-master redis-cli --scan --pattern "*" | xargs -I {} docker exec redis-master redis-cli ttl {}
# 4. 检查大键
echo "检查大键..."
docker exec redis-master redis-cli --bigkeys
echo "Redis优化完成"
```
### 3.3 日志管理
#### 3.3.1 日志轮转配置
```bash
# /etc/logrotate.d/xlxumu
/var/log/xlxumu/*.log {
daily
missingok
rotate 30
compress
delaycompress
notifempty
create 644 root root
postrotate
docker kill -s USR1 $(docker ps -q --filter name=backend-api)
endscript
}
/var/log/nginx/*.log {
daily
missingok
rotate 30
compress
delaycompress
notifempty
create 644 nginx nginx
postrotate
docker exec nginx-lb nginx -s reopen
endscript
}
```
#### 3.3.2 日志分析脚本
```bash
#!/bin/bash
# log-analysis.sh - 日志分析脚本
LOG_DIR="/var/log/xlxumu"
REPORT_FILE="/tmp/log-report-$(date +%Y%m%d).txt"
echo "=== 日志分析报告 $(date) ===" > $REPORT_FILE
# 1. 错误日志统计
echo "1. 错误日志统计" >> $REPORT_FILE
grep -i error $LOG_DIR/*.log | wc -l >> $REPORT_FILE
# 2. 访问量统计
echo "2. 今日访问量统计" >> $REPORT_FILE
grep "$(date +%d/%b/%Y)" /var/log/nginx/access.log | wc -l >> $REPORT_FILE
# 3. 状态码统计
echo "3. HTTP状态码统计" >> $REPORT_FILE
awk '{print $9}' /var/log/nginx/access.log | grep "$(date +%d/%b/%Y)" | sort | uniq -c | sort -nr >> $REPORT_FILE
# 4. 慢请求统计
echo "4. 慢请求统计(>1s)" >> $REPORT_FILE
awk '$NF > 1.0 {print $0}' /var/log/nginx/access.log | grep "$(date +%d/%b/%Y)" | wc -l >> $REPORT_FILE
# 5. 热门API统计
echo "5. 热门API统计" >> $REPORT_FILE
awk '{print $7}' /var/log/nginx/access.log | grep "$(date +%d/%b/%Y)" | grep "/api/" | sort | uniq -c | sort -nr | head -10 >> $REPORT_FILE
echo "日志分析完成,报告保存至: $REPORT_FILE"
```
## 4. 备份与恢复
### 4.1 自动备份策略
```bash
#!/bin/bash
# backup-system.sh - 系统备份脚本
BACKUP_DIR="/backup"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="$BACKUP_DIR/xlxumu_$DATE"
RETENTION_DAYS=30
# 创建备份目录
mkdir -p $BACKUP_PATH
echo "开始系统备份: $DATE"
# 1. 备份MySQL数据库
echo "备份MySQL数据库..."
docker exec mysql-master mysqldump -u root -p${MYSQL_ROOT_PASSWORD} \
--single-transaction \
--routines \
--triggers \
--all-databases > $BACKUP_PATH/mysql_backup.sql
if [ $? -eq 0 ]; then
echo "✅ MySQL备份成功"
else
echo "❌ MySQL备份失败"
exit 1
fi
# 2. 备份Redis数据
echo "备份Redis数据..."
docker exec redis-master redis-cli --rdb $BACKUP_PATH/redis_backup.rdb
docker cp redis-master:/data/dump.rdb $BACKUP_PATH/redis_backup.rdb
if [ $? -eq 0 ]; then
echo "✅ Redis备份成功"
else
echo "❌ Redis备份失败"
fi
# 3. 备份MongoDB数据
echo "备份MongoDB数据..."
docker exec mongodb mongodump --out $BACKUP_PATH/mongodb_backup
if [ $? -eq 0 ]; then
echo "✅ MongoDB备份成功"
else
echo "❌ MongoDB备份失败"
fi
# 4. 备份应用配置
echo "备份应用配置..."
cp -r ./config $BACKUP_PATH/
cp -r ./nginx $BACKUP_PATH/
cp .env.production $BACKUP_PATH/
# 5. 备份上传文件
echo "备份上传文件..."
if [ -d "./uploads" ]; then
tar -czf $BACKUP_PATH/uploads.tar.gz ./uploads
fi
# 6. 压缩备份文件
echo "压缩备份文件..."
cd $BACKUP_DIR
tar -czf xlxumu_$DATE.tar.gz xlxumu_$DATE/
rm -rf xlxumu_$DATE/
# 7. 清理过期备份
echo "清理过期备份..."
find $BACKUP_DIR -name "xlxumu_*.tar.gz" -mtime +$RETENTION_DAYS -delete
# 8. 上传到云存储(可选)
echo "上传备份到云存储..."
# aws s3 cp xlxumu_$DATE.tar.gz s3://your-backup-bucket/
echo "系统备份完成: xlxumu_$DATE.tar.gz"
# 9. 发送备份通知
curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \
-H 'Content-Type: application/json' \
-d "{\"msgtype\": \"text\",\"text\": {\"content\": \"系统备份完成: xlxumu_$DATE.tar.gz\"}}"
```
### 4.2 数据恢复流程
```bash
#!/bin/bash
# restore-system.sh - 系统恢复脚本
BACKUP_FILE=$1
BACKUP_DIR="/backup"
if [ -z "$BACKUP_FILE" ]; then
echo "使用方法: $0 <backup_file>"
echo "可用备份文件:"
ls -la $BACKUP_DIR/xlxumu_*.tar.gz
exit 1
fi
echo "开始系统恢复: $BACKUP_FILE"
# 1. 解压备份文件
echo "解压备份文件..."
cd $BACKUP_DIR
tar -xzf $BACKUP_FILE
BACKUP_NAME=$(basename $BACKUP_FILE .tar.gz)
RESTORE_PATH="$BACKUP_DIR/$BACKUP_NAME"
# 2. 停止服务
echo "停止服务..."
docker-compose down
# 3. 恢复MySQL数据库
echo "恢复MySQL数据库..."
docker-compose -f docker-compose.mysql.yml up -d mysql-master
sleep 30
docker exec -i mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} < $RESTORE_PATH/mysql_backup.sql
if [ $? -eq 0 ]; then
echo "✅ MySQL恢复成功"
else
echo "❌ MySQL恢复失败"
exit 1
fi
# 4. 恢复Redis数据
echo "恢复Redis数据..."
docker cp $RESTORE_PATH/redis_backup.rdb redis-master:/data/dump.rdb
docker restart redis-master
# 5. 恢复MongoDB数据
echo "恢复MongoDB数据..."
docker exec mongodb mongorestore $RESTORE_PATH/mongodb_backup
# 6. 恢复应用配置
echo "恢复应用配置..."
cp -r $RESTORE_PATH/config ./
cp -r $RESTORE_PATH/nginx ./
cp $RESTORE_PATH/.env.production ./
# 7. 恢复上传文件
echo "恢复上传文件..."
if [ -f "$RESTORE_PATH/uploads.tar.gz" ]; then
tar -xzf $RESTORE_PATH/uploads.tar.gz
fi
# 8. 重启服务
echo "重启服务..."
docker-compose up -d
# 9. 健康检查
echo "执行健康检查..."
sleep 60
./scripts/health-check.sh
echo "系统恢复完成"
```
## 5. 故障处理
### 5.1 故障响应流程
```mermaid
graph TD
A[故障发生] --> B[监控系统告警]
B --> C[运维人员接收告警]
C --> D[初步故障定位]
D --> E{故障等级判断}
E -->|P0严重| F[立即响应<br/>15分钟内]
E -->|P1重要| G[快速响应<br/>30分钟内]
E -->|P2一般| H[正常响应<br/>2小时内]
E -->|P3轻微| I[计划响应<br/>24小时内]
F --> J[故障处理]
G --> J
H --> J
I --> J
J --> K[服务恢复]
K --> L[根因分析]
L --> M[改进措施]
M --> N[文档更新]
```
### 5.2 常见故障处理手册
#### 5.2.1 服务无响应
```bash
#!/bin/bash
# fix-service-unresponsive.sh
SERVICE_NAME=$1
if [ -z "$SERVICE_NAME" ]; then
echo "使用方法: $0 <service_name>"
exit 1
fi
echo "处理服务无响应: $SERVICE_NAME"
# 1. 检查容器状态
echo "1. 检查容器状态"
docker ps -a | grep $SERVICE_NAME
# 2. 检查容器日志
echo "2. 检查容器日志"
docker logs --tail 100 $SERVICE_NAME
# 3. 检查资源使用
echo "3. 检查资源使用"
docker stats --no-stream $SERVICE_NAME
# 4. 尝试重启服务
echo "4. 尝试重启服务"
docker restart $SERVICE_NAME
# 5. 等待服务启动
echo "5. 等待服务启动"
sleep 30
# 6. 健康检查
echo "6. 执行健康检查"
case $SERVICE_NAME in
"backend-api-1")
curl -f http://localhost:3001/health
;;
"backend-api-2")
curl -f http://localhost:3002/health
;;
"nginx")
curl -f http://localhost:80/health
;;
esac
if [ $? -eq 0 ]; then
echo "✅ 服务恢复正常"
else
echo "❌ 服务仍然异常,需要进一步处理"
fi
```
#### 5.2.2 数据库连接异常
```bash
#!/bin/bash
# fix-database-connection.sh
echo "处理数据库连接异常"
# 1. 检查MySQL容器状态
echo "1. 检查MySQL容器状态"
docker ps | grep mysql-master
# 2. 检查MySQL进程
echo "2. 检查MySQL进程"
docker exec mysql-master ps aux | grep mysql
# 3. 检查MySQL连接数
echo "3. 检查MySQL连接数"
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW STATUS LIKE 'Threads_connected';"
# 4. 检查MySQL慢查询
echo "4. 检查MySQL慢查询"
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW PROCESSLIST;"
# 5. 检查MySQL错误日志
echo "5. 检查MySQL错误日志"
docker logs --tail 50 mysql-master | grep -i error
# 6. 重启MySQL服务如果必要
read -p "是否需要重启MySQL服务(y/n): " restart_mysql
if [ "$restart_mysql" = "y" ]; then
echo "重启MySQL服务..."
docker restart mysql-master
sleep 30
# 检查服务状态
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SELECT 1;"
if [ $? -eq 0 ]; then
echo "✅ MySQL服务恢复正常"
else
echo "❌ MySQL服务仍然异常"
fi
fi
```
#### 5.2.3 磁盘空间不足
```bash
#!/bin/bash
# fix-disk-space.sh
echo "处理磁盘空间不足"
# 1. 检查磁盘使用情况
echo "1. 磁盘使用情况"
df -h
# 2. 查找大文件
echo "2. 查找大文件(>100MB)"
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | head -20
# 3. 清理Docker资源
echo "3. 清理Docker资源"
docker system prune -f
docker volume prune -f
docker image prune -a -f
# 4. 清理日志文件
echo "4. 清理日志文件"
find /var/log -name "*.log" -type f -mtime +7 -exec truncate -s 0 {} \;
# 5. 清理临时文件
echo "5. 清理临时文件"
rm -rf /tmp/*
rm -rf /var/tmp/*
# 6. 清理旧备份文件
echo "6. 清理旧备份文件"
find /backup -name "*.tar.gz" -mtime +30 -delete
# 7. 再次检查磁盘空间
echo "7. 清理后磁盘使用情况"
df -h
echo "磁盘空间清理完成"
```
### 5.3 故障预防措施
#### 5.3.1 预防性维护脚本
```bash
#!/bin/bash
# preventive-maintenance.sh
echo "开始预防性维护"
# 1. 系统更新
echo "1. 系统更新检查"
yum check-update
# 2. 清理系统缓存
echo "2. 清理系统缓存"
echo 3 > /proc/sys/vm/drop_caches
# 3. 检查系统服务
echo "3. 检查系统服务"
systemctl status docker
systemctl status firewalld
# 4. 检查网络连接
echo "4. 检查网络连接"
netstat -tuln | grep -E "(80|443|3000|3306|6379|27017)"
# 5. 检查SSL证书有效期
echo "5. 检查SSL证书有效期"
openssl x509 -in /etc/letsencrypt/live/www.xlxumu.com/cert.pem -noout -dates
# 6. 数据库维护
echo "6. 数据库维护"
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "OPTIMIZE TABLE xlxumu_db.users, xlxumu_db.farms, xlxumu_db.animals;"
# 7. 性能基准测试
echo "7. 性能基准测试"
curl -w "@curl-format.txt" -o /dev/null -s http://localhost/api/health
echo "预防性维护完成"
```
## 6. 安全运维
### 6.1 安全检查清单
```bash
#!/bin/bash
# security-check.sh
echo "=== 安全检查开始 ==="
# 1. 检查系统用户
echo "1. 检查系统用户"
awk -F: '$3 >= 1000 {print $1}' /etc/passwd
# 2. 检查SSH配置
echo "2. 检查SSH配置"
grep -E "(PermitRootLogin|PasswordAuthentication|Port)" /etc/ssh/sshd_config
# 3. 检查防火墙状态
echo "3. 检查防火墙状态"
firewall-cmd --list-all
# 4. 检查开放端口
echo "4. 检查开放端口"
netstat -tuln
# 5. 检查失败登录尝试
echo "5. 检查失败登录尝试"
grep "Failed password" /var/log/secure | tail -10
# 6. 检查文件权限
echo "6. 检查关键文件权限"
ls -la /etc/passwd /etc/shadow /etc/ssh/sshd_config
# 7. 检查Docker安全
echo "7. 检查Docker安全"
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
-v /usr/local/bin/docker:/usr/local/bin/docker \
docker/docker-bench-security
# 8. 检查SSL证书
echo "8. 检查SSL证书"
echo | openssl s_client -servername www.xlxumu.com -connect www.xlxumu.com:443 2>/dev/null | openssl x509 -noout -dates
echo "=== 安全检查完成 ==="
```
### 6.2 安全加固措施
```bash
#!/bin/bash
# security-hardening.sh
echo "开始安全加固"
# 1. 禁用不必要的服务
echo "1. 禁用不必要的服务"
systemctl disable telnet
systemctl disable rsh
systemctl disable rlogin
# 2. 配置SSH安全
echo "2. 配置SSH安全"
sed -i 's/#PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config
systemctl restart sshd
# 3. 配置防火墙规则
echo "3. 配置防火墙规则"
firewall-cmd --permanent --remove-service=ssh
firewall-cmd --permanent --add-port=2222/tcp
firewall-cmd --reload
# 4. 设置文件权限
echo "4. 设置文件权限"
chmod 600 /etc/ssh/sshd_config
chmod 644 /etc/passwd
chmod 000 /etc/shadow
# 5. 配置日志审计
echo "5. 配置日志审计"
echo "auth.* /var/log/auth.log" >> /etc/rsyslog.conf
systemctl restart rsyslog
# 6. 安装入侵检测
echo "6. 安装入侵检测"
yum install -y fail2ban
systemctl enable fail2ban
systemctl start fail2ban
echo "安全加固完成"
```
## 7. 容量规划
### 7.1 容量监控指标
```bash
#!/bin/bash
# capacity-monitoring.sh
REPORT_FILE="/tmp/capacity-report-$(date +%Y%m%d).txt"
echo "=== 容量监控报告 $(date) ===" > $REPORT_FILE
# 1. 服务器资源使用趋势
echo "1. 服务器资源使用趋势" >> $REPORT_FILE
echo "CPU使用率: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}')" >> $REPORT_FILE
echo "内存使用率: $(free | grep Mem | awk '{printf("%.2f%%"), $3/$2 * 100.0}')" >> $REPORT_FILE
echo "磁盘使用率: $(df -h / | tail -1 | awk '{print $5}')" >> $REPORT_FILE
# 2. 数据库容量分析
echo "2. 数据库容量分析" >> $REPORT_FILE
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "
SELECT
table_schema AS '数据库',
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS '大小(MB)'
FROM information_schema.tables
WHERE table_schema = 'xlxumu_db'
GROUP BY table_schema;
" >> $REPORT_FILE
# 3. 用户增长趋势
echo "3. 用户增长趋势" >> $REPORT_FILE
docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} xlxumu_db -e "
SELECT
DATE(created_at) as date,
COUNT(*) as new_users
FROM users
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 30 DAY)
GROUP BY DATE(created_at)
ORDER BY date DESC
LIMIT 10;
" >> $REPORT_FILE
# 4. 存储空间预测
echo "4. 存储空间预测" >> $REPORT_FILE
current_usage=$(df / | tail -1 | awk '{print $3}')
growth_rate=5 # 假设每月增长5%
echo "当前使用: ${current_usage}KB" >> $REPORT_FILE
echo "预计3个月后: $((current_usage * (100 + growth_rate * 3) / 100))KB" >> $REPORT_FILE
echo "容量监控报告生成完成: $REPORT_FILE"
```
### 7.2 扩容建议
```bash
#!/bin/bash
# scaling-recommendations.sh
# 获取当前资源使用情况
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//')
mem_usage=$(free | grep Mem | awk '{printf("%.0f"), $3/$2 * 100.0}')
disk_usage=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
echo "=== 扩容建议 ==="
# CPU扩容建议
if [ "$cpu_usage" -gt 70 ]; then
echo "🔴 CPU使用率过高($cpu_usage%),建议:"
echo " - 增加CPU核心数"
echo " - 优化应用程序性能"
echo " - 考虑水平扩展"
elif [ "$cpu_usage" -gt 50 ]; then
echo "🟡 CPU使用率较高($cpu_usage%),建议监控"
else
echo "🟢 CPU使用率正常($cpu_usage%)"
fi
# 内存扩容建议
if [ "$mem_usage" -gt 80 ]; then
echo "🔴 内存使用率过高($mem_usage%),建议:"
echo " - 增加内存容量"
echo " - 优化内存使用"
echo " - 检查内存泄漏"
elif [ "$mem_usage" -gt 60 ]; then
echo "🟡 内存使用率较高($mem_usage%),建议监控"
else
echo "🟢 内存使用率正常($mem_usage%)"
fi
# 磁盘扩容建议
if [ "$disk_usage" -gt 85 ]; then
echo "🔴 磁盘使用率过高($disk_usage%),建议:"
echo " - 立即清理磁盘空间"
echo " - 扩展磁盘容量"
echo " - 迁移数据到其他存储"
elif [ "$disk_usage" -gt 70 ]; then
echo "🟡 磁盘使用率较高($disk_usage%),建议监控"
else
echo "🟢 磁盘使用率正常($disk_usage%)"
fi
# 数据库扩容建议
db_connections=$(docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW STATUS LIKE 'Threads_connected';" | tail -1 | awk '{print $2}')
max_connections=$(docker exec mysql-master mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SHOW VARIABLES LIKE 'max_connections';" | tail -1 | awk '{print $2}')
connection_usage=$((db_connections * 100 / max_connections))
if [ "$connection_usage" -gt 80 ]; then
echo "🔴 数据库连接使用率过高($connection_usage%),建议:"
echo " - 增加最大连接数"
echo " - 优化连接池配置"
echo " - 考虑读写分离"
fi
```
## 8. 应急预案
### 8.1 应急响应流程
```bash
#!/bin/bash
# emergency-response.sh
INCIDENT_TYPE=$1
SEVERITY=$2
case $INCIDENT_TYPE in
"service_down")
echo "服务下线应急处理"
# 1. 立即切换到备用服务
# 2. 通知相关人员
# 3. 开始故障排查
;;
"data_corruption")
echo "数据损坏应急处理"
# 1. 立即停止写入操作
# 2. 启动数据恢复流程
# 3. 通知业务方
;;
"security_breach")
echo "安全事件应急处理"
# 1. 隔离受影响系统
# 2. 收集证据
# 3. 通知安全团队
;;
esac
```
### 8.2 灾难恢复计划
```bash
#!/bin/bash
# disaster-recovery.sh
echo "=== 灾难恢复计划 ==="
# 1. 评估损失程度
echo "1. 评估系统损失程度"
# 2. 启动备用系统
echo "2. 启动备用系统"
# 切换到备用数据中心
# 3. 数据恢复
echo "3. 开始数据恢复"
# 从最近备份恢复数据
# 4. 服务验证
echo "4. 验证服务功能"
# 执行完整的功能测试
# 5. 切换流量
echo "5. 切换用户流量"
# 更新DNS指向新系统
echo "灾难恢复完成"
```
## 9. 运维工具
### 9.1 运维脚本集合
```bash
#!/bin/bash
# ops-toolkit.sh - 运维工具箱
show_menu() {
echo "=== 运维工具箱 ==="
echo "1. 系统状态检查"
echo "2. 服务重启"
echo "3. 日志查看"
echo "4. 性能监控"
echo "5. 备份操作"
echo "6. 故障诊断"
echo "7. 安全检查"
echo "8. 容量分析"
echo "0. 退出"
echo "================="
}
while true; do
show_menu
read -p "请选择操作: " choice
case $choice in
1)
./scripts/system-check.sh
;;
2)
read -p "请输入服务名: " service
docker restart $service
;;
3)
read -p "请输入容器名: " container
docker logs --tail 100 -f $container
;;
4)
./scripts/performance-monitor.sh
;;
5)
./scripts/backup-system.sh
;;
6)
./scripts/troubleshoot.sh
;;
7)
./scripts/security-check.sh
;;
8)
./scripts/capacity-monitoring.sh
;;
0)
echo "退出运维工具箱"
break
;;
*)
echo "无效选择,请重新输入"
;;
esac
echo "按回车键继续..."
read
done
```
### 9.2 自动化运维脚本
```bash
#!/bin/bash
# auto-ops.sh - 自动化运维
# 定时任务配置
setup_cron_jobs() {
echo "配置定时任务"
# 每日备份
echo "0 2 * * * /opt/xlxumu/scripts/backup-system.sh" >> /var/spool/cron/root
# 每小时系统检查
echo "0 * * * * /opt/xlxumu/scripts/system-check.sh" >> /var/spool/cron/root
# 每日日志清理
echo "0 3 * * * /opt/xlxumu/scripts/log-cleanup.sh" >> /var/spool/cron/root
# 每周性能报告
echo "0 9 * * 1 /opt/xlxumu/scripts/performance-report.sh" >> /var/spool/cron/root
systemctl restart crond
}
# 自动故障恢复
auto_recovery() {
# 检查服务状态并自动重启
services=("mysql-master" "redis-master" "backend-api-1" "backend-api-2")
for service in "${services[@]}"; do
if ! docker ps | grep -q $service; then
echo "检测到 $service 服务异常,尝试自动恢复"
docker restart $service
sleep 30
# 验证恢复结果
if docker ps | grep -q $service; then
echo "$service 服务恢复成功"
# 发送恢复通知
send_notification "服务自动恢复" "$service 服务已自动恢复正常"
else
echo "$service 服务恢复失败,需要人工介入"
# 发送告警通知
send_alert "服务恢复失败" "$service 服务自动恢复失败,需要人工处理"
fi
fi
done
}
# 发送通知
send_notification() {
local title=$1
local message=$2
# 钉钉通知
curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \
-H 'Content-Type: application/json' \
-d "{\"msgtype\": \"text\",\"text\": {\"content\": \"$title: $message\"}}"
}
# 发送告警
send_alert() {
local title=$1
local message=$2
# 发送邮件告警
echo "$message" | mail -s "$title" ops@xlxumu.com
# 发送短信告警(集成短信服务)
# curl -X POST "SMS_API_URL" -d "phone=13800000000&message=$message"
}
# 主函数
main() {
case $1 in
"setup")
setup_cron_jobs
;;
"recovery")
auto_recovery
;;
*)
echo "使用方法: $0 {setup|recovery}"
;;
esac
}
main "$@"
```
## 10. 总结
### 10.1 运维最佳实践
1. **预防为主**:通过监控和预防性维护减少故障发生
2. **自动化优先**:尽可能自动化日常运维操作
3. **文档完善**:维护详细的运维文档和操作手册
4. **持续改进**:根据运维经验不断优化流程和工具
5. **团队协作**:建立有效的运维团队协作机制
### 10.2 关键指标监控
- **可用性**: 99.9%+
- **响应时间**: < 500ms
- **错误率**: < 0.1%
- **恢复时间**: < 30分钟
- **备份成功率**: 100%
### 10.3 持续优化方向
1. **监控体系完善**:增加更多业务指标监控
2. **自动化程度提升**:扩大自动化运维覆盖范围
3. **故障预测能力**基于AI的故障预测和预防
4. **运维效率提升**:优化运维工具和流程
5. **安全防护加强**:持续加强安全防护措施
---
**文档版本**: v1.0.0
**最后更新**: 2024年12月
**维护团队**: 运维团队
**联系方式**: ops@xlxumu.com