1359 lines
33 KiB
Markdown
1359 lines
33 KiB
Markdown
|
|
# 解班客项目运维文档
|
|||
|
|
|
|||
|
|
## 1. 运维概述
|
|||
|
|
|
|||
|
|
### 1.1 运维目标
|
|||
|
|
- **高可用性**:确保系统7×24小时稳定运行,可用性达到99.9%
|
|||
|
|
- **高性能**:保证系统响应速度,接口响应时间<500ms
|
|||
|
|
- **安全可靠**:确保数据安全,防范各类安全威胁
|
|||
|
|
- **快速恢复**:故障发生时能够快速定位和恢复
|
|||
|
|
|
|||
|
|
### 1.2 运维架构
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph TB
|
|||
|
|
A[监控告警系统] --> B[日志分析系统]
|
|||
|
|
A --> C[性能监控]
|
|||
|
|
A --> D[业务监控]
|
|||
|
|
|
|||
|
|
B --> E[ELK Stack]
|
|||
|
|
C --> F[Prometheus+Grafana]
|
|||
|
|
D --> G[自定义监控]
|
|||
|
|
|
|||
|
|
H[自动化运维] --> I[CI/CD流水线]
|
|||
|
|
H --> J[自动化部署]
|
|||
|
|
H --> K[自动化备份]
|
|||
|
|
|
|||
|
|
L[故障处理] --> M[告警响应]
|
|||
|
|
L --> N[故障定位]
|
|||
|
|
L --> O[快速恢复]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 1.3 运维团队职责
|
|||
|
|
|
|||
|
|
| 角色 | 职责 | 技能要求 |
|
|||
|
|
|------|------|----------|
|
|||
|
|
| 运维工程师 | 系统监控、故障处理、日常维护 | Linux、Docker、监控工具 |
|
|||
|
|
| DBA | 数据库管理、性能优化、备份恢复 | MySQL、Redis、数据库优化 |
|
|||
|
|
| 安全工程师 | 安全防护、漏洞扫描、安全审计 | 网络安全、渗透测试 |
|
|||
|
|
| 架构师 | 架构优化、容量规划、技术选型 | 系统架构、性能调优 |
|
|||
|
|
|
|||
|
|
## 2. 系统监控
|
|||
|
|
|
|||
|
|
### 2.1 监控体系
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph TD
|
|||
|
|
A[系统监控] --> B[基础设施监控]
|
|||
|
|
A --> C[应用监控]
|
|||
|
|
A --> D[业务监控]
|
|||
|
|
|
|||
|
|
B --> B1[服务器资源]
|
|||
|
|
B --> B2[网络状态]
|
|||
|
|
B --> B3[存储空间]
|
|||
|
|
|
|||
|
|
C --> C1[应用性能]
|
|||
|
|
C --> C2[接口响应]
|
|||
|
|
C --> C3[错误率]
|
|||
|
|
|
|||
|
|
D --> D1[用户行为]
|
|||
|
|
D --> D2[业务指标]
|
|||
|
|
D --> D3[收入数据]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2.2 Prometheus监控配置
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# prometheus.yml
|
|||
|
|
global:
|
|||
|
|
scrape_interval: 15s
|
|||
|
|
evaluation_interval: 15s
|
|||
|
|
|
|||
|
|
rule_files:
|
|||
|
|
- "alert_rules.yml"
|
|||
|
|
|
|||
|
|
alerting:
|
|||
|
|
alertmanagers:
|
|||
|
|
- static_configs:
|
|||
|
|
- targets:
|
|||
|
|
- alertmanager:9093
|
|||
|
|
|
|||
|
|
scrape_configs:
|
|||
|
|
# 系统监控
|
|||
|
|
- job_name: 'node-exporter'
|
|||
|
|
static_configs:
|
|||
|
|
- targets: ['localhost:9100']
|
|||
|
|
scrape_interval: 5s
|
|||
|
|
|
|||
|
|
# 应用监控
|
|||
|
|
- job_name: 'jiebanke-api'
|
|||
|
|
static_configs:
|
|||
|
|
- targets: ['localhost:3000']
|
|||
|
|
metrics_path: '/metrics'
|
|||
|
|
scrape_interval: 10s
|
|||
|
|
|
|||
|
|
# 数据库监控
|
|||
|
|
- job_name: 'mysql-exporter'
|
|||
|
|
static_configs:
|
|||
|
|
- targets: ['localhost:9104']
|
|||
|
|
|
|||
|
|
# Redis监控
|
|||
|
|
- job_name: 'redis-exporter'
|
|||
|
|
static_configs:
|
|||
|
|
- targets: ['localhost:9121']
|
|||
|
|
|
|||
|
|
# Nginx监控
|
|||
|
|
- job_name: 'nginx-exporter'
|
|||
|
|
static_configs:
|
|||
|
|
- targets: ['localhost:9113']
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2.3 告警规则配置
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# alert_rules.yml
|
|||
|
|
groups:
|
|||
|
|
- name: system_alerts
|
|||
|
|
rules:
|
|||
|
|
# CPU使用率告警
|
|||
|
|
- alert: HighCPUUsage
|
|||
|
|
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
|||
|
|
for: 2m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "CPU使用率过高"
|
|||
|
|
description: "实例 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value }}%"
|
|||
|
|
|
|||
|
|
# 内存使用率告警
|
|||
|
|
- alert: HighMemoryUsage
|
|||
|
|
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
|
|||
|
|
for: 2m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "内存使用率过高"
|
|||
|
|
description: "实例 {{ $labels.instance }} 内存使用率超过85%,当前值: {{ $value }}%"
|
|||
|
|
|
|||
|
|
# 磁盘空间告警
|
|||
|
|
- alert: HighDiskUsage
|
|||
|
|
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 90
|
|||
|
|
for: 1m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "磁盘空间不足"
|
|||
|
|
description: "实例 {{ $labels.instance }} 磁盘使用率超过90%,当前值: {{ $value }}%"
|
|||
|
|
|
|||
|
|
- name: application_alerts
|
|||
|
|
rules:
|
|||
|
|
# 接口响应时间告警
|
|||
|
|
- alert: HighResponseTime
|
|||
|
|
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
|
|||
|
|
for: 2m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "接口响应时间过长"
|
|||
|
|
description: "95%的请求响应时间超过1秒,当前值: {{ $value }}s"
|
|||
|
|
|
|||
|
|
# 错误率告警
|
|||
|
|
- alert: HighErrorRate
|
|||
|
|
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
|
|||
|
|
for: 1m
|
|||
|
|
labels:
|
|||
|
|
severity: critical
|
|||
|
|
annotations:
|
|||
|
|
summary: "错误率过高"
|
|||
|
|
description: "5xx错误率超过5%,当前值: {{ $value }}"
|
|||
|
|
|
|||
|
|
# 数据库连接数告警
|
|||
|
|
- alert: HighDatabaseConnections
|
|||
|
|
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
|
|||
|
|
for: 2m
|
|||
|
|
labels:
|
|||
|
|
severity: warning
|
|||
|
|
annotations:
|
|||
|
|
summary: "数据库连接数过高"
|
|||
|
|
description: "数据库连接数超过最大连接数的80%"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2.4 Grafana仪表板
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"dashboard": {
|
|||
|
|
"title": "解班客系统监控",
|
|||
|
|
"panels": [
|
|||
|
|
{
|
|||
|
|
"title": "CPU使用率",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
|
|||
|
|
"legendFormat": "CPU使用率"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"title": "内存使用率",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
|
|||
|
|
"legendFormat": "内存使用率"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"title": "接口QPS",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "rate(http_requests_total[1m])",
|
|||
|
|
"legendFormat": "{{ method }} {{ handler }}"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"title": "接口响应时间",
|
|||
|
|
"type": "graph",
|
|||
|
|
"targets": [
|
|||
|
|
{
|
|||
|
|
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
|
|||
|
|
"legendFormat": "95th percentile"
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
|
|||
|
|
"legendFormat": "50th percentile"
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 3. 日志管理
|
|||
|
|
|
|||
|
|
### 3.1 日志收集架构
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph LR
|
|||
|
|
A[应用日志] --> B[Filebeat]
|
|||
|
|
C[Nginx日志] --> B
|
|||
|
|
D[系统日志] --> B
|
|||
|
|
|
|||
|
|
B --> E[Logstash]
|
|||
|
|
E --> F[Elasticsearch]
|
|||
|
|
F --> G[Kibana]
|
|||
|
|
|
|||
|
|
H[日志告警] --> I[ElastAlert]
|
|||
|
|
I --> J[钉钉/邮件]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.2 Logstash配置
|
|||
|
|
|
|||
|
|
```ruby
|
|||
|
|
# logstash.conf
|
|||
|
|
input {
|
|||
|
|
beats {
|
|||
|
|
port => 5044
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
filter {
|
|||
|
|
if [fields][log_type] == "nginx_access" {
|
|||
|
|
grok {
|
|||
|
|
match => {
|
|||
|
|
"message" => "%{NGINXACCESS}"
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
date {
|
|||
|
|
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
|
|||
|
|
}
|
|||
|
|
mutate {
|
|||
|
|
convert => { "response_time" => "float" }
|
|||
|
|
convert => { "bytes" => "integer" }
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
if [fields][log_type] == "application" {
|
|||
|
|
json {
|
|||
|
|
source => "message"
|
|||
|
|
}
|
|||
|
|
date {
|
|||
|
|
match => [ "timestamp", "ISO8601" ]
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
if [fields][log_type] == "error" {
|
|||
|
|
multiline {
|
|||
|
|
pattern => "^\d{4}-\d{2}-\d{2}"
|
|||
|
|
negate => true
|
|||
|
|
what => "previous"
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
output {
|
|||
|
|
elasticsearch {
|
|||
|
|
hosts => ["elasticsearch:9200"]
|
|||
|
|
index => "jiebanke-logs-%{+YYYY.MM.dd}"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
if [level] == "ERROR" {
|
|||
|
|
email {
|
|||
|
|
to => "ops@jiebanke.com"
|
|||
|
|
subject => "应用错误告警"
|
|||
|
|
body => "错误信息: %{message}"
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.3 日志分析脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# log-analysis.sh - 日志分析脚本
|
|||
|
|
|
|||
|
|
LOG_DIR="/opt/jiebanke/logs"
|
|||
|
|
DATE=$(date +%Y-%m-%d)
|
|||
|
|
|
|||
|
|
echo "=== 解班客日志分析报告 ($DATE) ==="
|
|||
|
|
|
|||
|
|
# 分析Nginx访问日志
|
|||
|
|
echo "1. 访问统计"
|
|||
|
|
echo "总访问量: $(wc -l < $LOG_DIR/nginx/access.log)"
|
|||
|
|
echo "独立IP数: $(awk '{print $1}' $LOG_DIR/nginx/access.log | sort -u | wc -l)"
|
|||
|
|
echo "状态码统计:"
|
|||
|
|
awk '{print $9}' $LOG_DIR/nginx/access.log | sort | uniq -c | sort -nr
|
|||
|
|
|
|||
|
|
# 分析响应时间
|
|||
|
|
echo -e "\n2. 响应时间分析"
|
|||
|
|
echo "平均响应时间: $(awk '{sum+=$10; count++} END {print sum/count}' $LOG_DIR/nginx/access.log)ms"
|
|||
|
|
echo "最慢的10个请求:"
|
|||
|
|
sort -k10 -nr $LOG_DIR/nginx/access.log | head -10 | awk '{print $7, $10"ms"}'
|
|||
|
|
|
|||
|
|
# 分析错误日志
|
|||
|
|
echo -e "\n3. 错误统计"
|
|||
|
|
echo "应用错误数: $(grep -c "ERROR" $LOG_DIR/api/app.log)"
|
|||
|
|
echo "数据库错误数: $(grep -c "database" $LOG_DIR/api/error.log)"
|
|||
|
|
|
|||
|
|
# 分析用户行为
|
|||
|
|
echo -e "\n4. 用户行为分析"
|
|||
|
|
echo "最受欢迎的API:"
|
|||
|
|
awk '$7 ~ /^\/api/ {print $7}' $LOG_DIR/nginx/access.log | sort | uniq -c | sort -nr | head -10
|
|||
|
|
|
|||
|
|
echo -e "\n5. 异常IP分析"
|
|||
|
|
echo "请求量最大的IP:"
|
|||
|
|
awk '{print $1}' $LOG_DIR/nginx/access.log | sort | uniq -c | sort -nr | head -10
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 4. 性能优化
|
|||
|
|
|
|||
|
|
### 4.1 数据库优化
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
-- 慢查询分析
|
|||
|
|
SELECT
|
|||
|
|
query_time,
|
|||
|
|
lock_time,
|
|||
|
|
rows_sent,
|
|||
|
|
rows_examined,
|
|||
|
|
sql_text
|
|||
|
|
FROM mysql.slow_log
|
|||
|
|
WHERE start_time >= DATE_SUB(NOW(), INTERVAL 1 DAY)
|
|||
|
|
ORDER BY query_time DESC
|
|||
|
|
LIMIT 10;
|
|||
|
|
|
|||
|
|
-- 索引使用分析
|
|||
|
|
SELECT
|
|||
|
|
table_schema,
|
|||
|
|
table_name,
|
|||
|
|
index_name,
|
|||
|
|
cardinality,
|
|||
|
|
sub_part,
|
|||
|
|
packed,
|
|||
|
|
nullable,
|
|||
|
|
index_type
|
|||
|
|
FROM information_schema.statistics
|
|||
|
|
WHERE table_schema = 'jiebanke'
|
|||
|
|
ORDER BY cardinality DESC;
|
|||
|
|
|
|||
|
|
-- 表空间分析
|
|||
|
|
SELECT
|
|||
|
|
table_name,
|
|||
|
|
ROUND(((data_length + index_length) / 1024 / 1024), 2) AS 'Size (MB)',
|
|||
|
|
table_rows
|
|||
|
|
FROM information_schema.tables
|
|||
|
|
WHERE table_schema = 'jiebanke'
|
|||
|
|
ORDER BY (data_length + index_length) DESC;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4.2 缓存优化
|
|||
|
|
|
|||
|
|
```javascript
|
|||
|
|
// Redis缓存优化配置
|
|||
|
|
const redisConfig = {
|
|||
|
|
// 连接池配置
|
|||
|
|
pool: {
|
|||
|
|
min: 5,
|
|||
|
|
max: 20,
|
|||
|
|
acquireTimeoutMillis: 30000,
|
|||
|
|
createTimeoutMillis: 30000,
|
|||
|
|
destroyTimeoutMillis: 5000,
|
|||
|
|
idleTimeoutMillis: 30000,
|
|||
|
|
reapIntervalMillis: 1000,
|
|||
|
|
createRetryIntervalMillis: 200
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
// 缓存策略
|
|||
|
|
cache: {
|
|||
|
|
// 用户信息缓存30分钟
|
|||
|
|
user: { ttl: 1800 },
|
|||
|
|
// 旅行信息缓存1小时
|
|||
|
|
trip: { ttl: 3600 },
|
|||
|
|
// 热门数据缓存6小时
|
|||
|
|
hot: { ttl: 21600 },
|
|||
|
|
// 配置信息缓存24小时
|
|||
|
|
config: { ttl: 86400 }
|
|||
|
|
}
|
|||
|
|
};
|
|||
|
|
|
|||
|
|
// 缓存预热脚本
|
|||
|
|
async function warmupCache() {
|
|||
|
|
console.log('开始缓存预热...');
|
|||
|
|
|
|||
|
|
// 预热热门旅行数据
|
|||
|
|
const hotTrips = await Trip.findAll({
|
|||
|
|
where: { status: 'active' },
|
|||
|
|
order: [['view_count', 'DESC']],
|
|||
|
|
limit: 100
|
|||
|
|
});
|
|||
|
|
|
|||
|
|
for (const trip of hotTrips) {
|
|||
|
|
await redis.setex(`trip:${trip.id}`, 3600, JSON.stringify(trip));
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 预热系统配置
|
|||
|
|
const configs = await Config.findAll();
|
|||
|
|
for (const config of configs) {
|
|||
|
|
await redis.setex(`config:${config.key}`, 86400, config.value);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
console.log('缓存预热完成');
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4.3 应用性能优化
|
|||
|
|
|
|||
|
|
```javascript
|
|||
|
|
// 性能监控中间件
|
|||
|
|
const performanceMonitor = (req, res, next) => {
|
|||
|
|
const start = Date.now();
|
|||
|
|
|
|||
|
|
res.on('finish', () => {
|
|||
|
|
const duration = Date.now() - start;
|
|||
|
|
const memUsage = process.memoryUsage();
|
|||
|
|
|
|||
|
|
// 记录性能指标
|
|||
|
|
prometheus.httpRequestDuration
|
|||
|
|
.labels(req.method, req.route?.path || req.path, res.statusCode)
|
|||
|
|
.observe(duration / 1000);
|
|||
|
|
|
|||
|
|
prometheus.memoryUsage.set(memUsage.heapUsed);
|
|||
|
|
|
|||
|
|
// 慢请求告警
|
|||
|
|
if (duration > 1000) {
|
|||
|
|
logger.warn('Slow request detected', {
|
|||
|
|
method: req.method,
|
|||
|
|
path: req.path,
|
|||
|
|
duration,
|
|||
|
|
userAgent: req.get('User-Agent')
|
|||
|
|
});
|
|||
|
|
}
|
|||
|
|
});
|
|||
|
|
|
|||
|
|
next();
|
|||
|
|
};
|
|||
|
|
|
|||
|
|
// 数据库连接池优化
|
|||
|
|
const dbConfig = {
|
|||
|
|
pool: {
|
|||
|
|
max: 20,
|
|||
|
|
min: 5,
|
|||
|
|
acquire: 30000,
|
|||
|
|
idle: 10000
|
|||
|
|
},
|
|||
|
|
logging: (sql, timing) => {
|
|||
|
|
if (timing > 1000) {
|
|||
|
|
logger.warn('Slow query detected', { sql, timing });
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
};
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 5. 备份策略
|
|||
|
|
|
|||
|
|
### 5.1 备份计划
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
gantt
|
|||
|
|
title 备份计划
|
|||
|
|
dateFormat HH:mm
|
|||
|
|
axisFormat %H:%M
|
|||
|
|
|
|||
|
|
section 数据库备份
|
|||
|
|
全量备份 :db-full, 02:00, 1h
|
|||
|
|
增量备份 :db-inc, 06:00, 30m
|
|||
|
|
增量备份 :db-inc2, 12:00, 30m
|
|||
|
|
增量备份 :db-inc3, 18:00, 30m
|
|||
|
|
|
|||
|
|
section 文件备份
|
|||
|
|
应用文件 :file-app, 03:00, 30m
|
|||
|
|
日志文件 :file-log, 04:00, 30m
|
|||
|
|
配置文件 :file-config, 05:00, 15m
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.2 自动化备份脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# auto-backup.sh - 自动化备份脚本
|
|||
|
|
|
|||
|
|
set -e
|
|||
|
|
|
|||
|
|
# 配置变量
|
|||
|
|
BACKUP_ROOT="/opt/backup"
|
|||
|
|
DATE=$(date +%Y%m%d_%H%M%S)
|
|||
|
|
RETENTION_DAYS=30
|
|||
|
|
DB_NAME="jiebanke"
|
|||
|
|
DB_USER="backup_user"
|
|||
|
|
DB_PASSWORD="backup_password"
|
|||
|
|
|
|||
|
|
# 创建备份目录
|
|||
|
|
mkdir -p "$BACKUP_ROOT"/{database,files,logs}
|
|||
|
|
|
|||
|
|
# 数据库备份函数
|
|||
|
|
backup_database() {
|
|||
|
|
echo "开始数据库备份..."
|
|||
|
|
|
|||
|
|
# 全量备份
|
|||
|
|
mysqldump -u"$DB_USER" -p"$DB_PASSWORD" \
|
|||
|
|
--single-transaction \
|
|||
|
|
--routines \
|
|||
|
|
--triggers \
|
|||
|
|
--events \
|
|||
|
|
--hex-blob \
|
|||
|
|
"$DB_NAME" | gzip > "$BACKUP_ROOT/database/full_${DATE}.sql.gz"
|
|||
|
|
|
|||
|
|
# 二进制日志备份
|
|||
|
|
mysql -u"$DB_USER" -p"$DB_PASSWORD" -e "FLUSH LOGS;"
|
|||
|
|
cp /var/lib/mysql/mysql-bin.* "$BACKUP_ROOT/database/" 2>/dev/null || true
|
|||
|
|
|
|||
|
|
echo "数据库备份完成"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 文件备份函数
|
|||
|
|
backup_files() {
|
|||
|
|
echo "开始文件备份..."
|
|||
|
|
|
|||
|
|
# 应用文件备份
|
|||
|
|
tar -czf "$BACKUP_ROOT/files/app_${DATE}.tar.gz" \
|
|||
|
|
-C /opt/jiebanke \
|
|||
|
|
--exclude='node_modules' \
|
|||
|
|
--exclude='logs' \
|
|||
|
|
--exclude='tmp' \
|
|||
|
|
.
|
|||
|
|
|
|||
|
|
# 配置文件备份
|
|||
|
|
tar -czf "$BACKUP_ROOT/files/config_${DATE}.tar.gz" \
|
|||
|
|
/etc/nginx \
|
|||
|
|
/etc/mysql \
|
|||
|
|
/etc/redis
|
|||
|
|
|
|||
|
|
echo "文件备份完成"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 日志备份函数
|
|||
|
|
backup_logs() {
|
|||
|
|
echo "开始日志备份..."
|
|||
|
|
|
|||
|
|
# 压缩昨天的日志
|
|||
|
|
find /opt/jiebanke/logs -name "*.log" -mtime +1 -exec gzip {} \;
|
|||
|
|
|
|||
|
|
# 备份日志文件
|
|||
|
|
tar -czf "$BACKUP_ROOT/logs/logs_${DATE}.tar.gz" \
|
|||
|
|
/opt/jiebanke/logs \
|
|||
|
|
/var/log/nginx \
|
|||
|
|
/var/log/mysql
|
|||
|
|
|
|||
|
|
echo "日志备份完成"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 清理过期备份
|
|||
|
|
cleanup_old_backups() {
|
|||
|
|
echo "清理过期备份..."
|
|||
|
|
|
|||
|
|
find "$BACKUP_ROOT" -type f -mtime +$RETENTION_DAYS -delete
|
|||
|
|
|
|||
|
|
echo "过期备份清理完成"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 备份验证
|
|||
|
|
verify_backup() {
|
|||
|
|
echo "验证备份完整性..."
|
|||
|
|
|
|||
|
|
# 验证数据库备份
|
|||
|
|
if [ -f "$BACKUP_ROOT/database/full_${DATE}.sql.gz" ]; then
|
|||
|
|
gunzip -t "$BACKUP_ROOT/database/full_${DATE}.sql.gz"
|
|||
|
|
echo "数据库备份验证通过"
|
|||
|
|
else
|
|||
|
|
echo "数据库备份验证失败"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 验证文件备份
|
|||
|
|
if [ -f "$BACKUP_ROOT/files/app_${DATE}.tar.gz" ]; then
|
|||
|
|
tar -tzf "$BACKUP_ROOT/files/app_${DATE}.tar.gz" > /dev/null
|
|||
|
|
echo "文件备份验证通过"
|
|||
|
|
else
|
|||
|
|
echo "文件备份验证失败"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 发送备份报告
|
|||
|
|
send_backup_report() {
|
|||
|
|
local status=$1
|
|||
|
|
local message="备份任务 $status - $(date)"
|
|||
|
|
|
|||
|
|
# 发送钉钉通知
|
|||
|
|
curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d "{
|
|||
|
|
\"msgtype\": \"text\",
|
|||
|
|
\"text\": {
|
|||
|
|
\"content\": \"$message\"
|
|||
|
|
}
|
|||
|
|
}"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 主执行流程
|
|||
|
|
main() {
|
|||
|
|
echo "开始执行备份任务 - $(date)"
|
|||
|
|
|
|||
|
|
trap 'send_backup_report "失败"' ERR
|
|||
|
|
|
|||
|
|
backup_database
|
|||
|
|
backup_files
|
|||
|
|
backup_logs
|
|||
|
|
verify_backup
|
|||
|
|
cleanup_old_backups
|
|||
|
|
|
|||
|
|
send_backup_report "成功"
|
|||
|
|
echo "备份任务完成 - $(date)"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 执行主函数
|
|||
|
|
main "$@"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.3 备份恢复流程
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# restore.sh - 数据恢复脚本
|
|||
|
|
|
|||
|
|
set -e
|
|||
|
|
|
|||
|
|
BACKUP_ROOT="/opt/backup"
|
|||
|
|
DB_NAME="jiebanke"
|
|||
|
|
DB_USER="root"
|
|||
|
|
DB_PASSWORD="your_password"
|
|||
|
|
|
|||
|
|
# 数据库恢复函数
|
|||
|
|
restore_database() {
|
|||
|
|
local backup_file=$1
|
|||
|
|
|
|||
|
|
echo "开始恢复数据库..."
|
|||
|
|
|
|||
|
|
# 停止应用服务
|
|||
|
|
pm2 stop all
|
|||
|
|
|
|||
|
|
# 创建恢复数据库
|
|||
|
|
mysql -u"$DB_USER" -p"$DB_PASSWORD" -e "DROP DATABASE IF EXISTS ${DB_NAME}_restore;"
|
|||
|
|
mysql -u"$DB_USER" -p"$DB_PASSWORD" -e "CREATE DATABASE ${DB_NAME}_restore;"
|
|||
|
|
|
|||
|
|
# 恢复数据
|
|||
|
|
gunzip -c "$backup_file" | mysql -u"$DB_USER" -p"$DB_PASSWORD" "${DB_NAME}_restore"
|
|||
|
|
|
|||
|
|
# 验证数据完整性
|
|||
|
|
local table_count=$(mysql -u"$DB_USER" -p"$DB_PASSWORD" -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='${DB_NAME}_restore';" -N)
|
|||
|
|
|
|||
|
|
if [ "$table_count" -gt 0 ]; then
|
|||
|
|
echo "数据库恢复成功,表数量: $table_count"
|
|||
|
|
|
|||
|
|
# 切换数据库
|
|||
|
|
mysql -u"$DB_USER" -p"$DB_PASSWORD" -e "RENAME TABLE ${DB_NAME} TO ${DB_NAME}_backup_$(date +%Y%m%d_%H%M%S);"
|
|||
|
|
mysql -u"$DB_USER" -p"$DB_PASSWORD" -e "RENAME TABLE ${DB_NAME}_restore TO ${DB_NAME};"
|
|||
|
|
|
|||
|
|
# 重启应用服务
|
|||
|
|
pm2 start all
|
|||
|
|
|
|||
|
|
echo "数据库恢复完成"
|
|||
|
|
else
|
|||
|
|
echo "数据库恢复失败"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 文件恢复函数
|
|||
|
|
restore_files() {
|
|||
|
|
local backup_file=$1
|
|||
|
|
local restore_path="/opt/jiebanke_restore"
|
|||
|
|
|
|||
|
|
echo "开始恢复文件..."
|
|||
|
|
|
|||
|
|
# 创建恢复目录
|
|||
|
|
mkdir -p "$restore_path"
|
|||
|
|
|
|||
|
|
# 解压文件
|
|||
|
|
tar -xzf "$backup_file" -C "$restore_path"
|
|||
|
|
|
|||
|
|
echo "文件恢复到: $restore_path"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 使用示例
|
|||
|
|
case "$1" in
|
|||
|
|
database)
|
|||
|
|
if [ -z "$2" ]; then
|
|||
|
|
echo "请指定备份文件路径"
|
|||
|
|
echo "用法: $0 database /path/to/backup.sql.gz"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
restore_database "$2"
|
|||
|
|
;;
|
|||
|
|
files)
|
|||
|
|
if [ -z "$2" ]; then
|
|||
|
|
echo "请指定备份文件路径"
|
|||
|
|
echo "用法: $0 files /path/to/backup.tar.gz"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
restore_files "$2"
|
|||
|
|
;;
|
|||
|
|
*)
|
|||
|
|
echo "用法: $0 {database|files} backup_file"
|
|||
|
|
exit 1
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 6. 故障处理
|
|||
|
|
|
|||
|
|
### 6.1 故障响应流程
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
graph TD
|
|||
|
|
A[故障告警] --> B[故障确认]
|
|||
|
|
B --> C[影响评估]
|
|||
|
|
C --> D[应急响应]
|
|||
|
|
D --> E[问题定位]
|
|||
|
|
E --> F[故障修复]
|
|||
|
|
F --> G[服务恢复]
|
|||
|
|
G --> H[故障总结]
|
|||
|
|
|
|||
|
|
D --> D1[服务降级]
|
|||
|
|
D --> D2[流量切换]
|
|||
|
|
D --> D3[紧急通知]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 6.2 常见故障处理手册
|
|||
|
|
|
|||
|
|
| 故障类型 | 症状 | 处理步骤 | 预计恢复时间 |
|
|||
|
|
|---------|------|----------|-------------|
|
|||
|
|
| 服务无响应 | 接口超时、连接失败 | 1.检查进程状态 2.重启服务 3.检查日志 | 5-10分钟 |
|
|||
|
|
| 数据库连接失败 | 数据库错误、连接超时 | 1.检查数据库状态 2.检查连接池 3.重启数据库 | 10-15分钟 |
|
|||
|
|
| 内存不足 | OOM错误、服务崩溃 | 1.释放内存 2.重启服务 3.扩容内存 | 15-30分钟 |
|
|||
|
|
| 磁盘空间不足 | 写入失败、日志错误 | 1.清理日志 2.清理临时文件 3.扩容磁盘 | 10-20分钟 |
|
|||
|
|
| 网络故障 | 连接超时、丢包 | 1.检查网络连通性 2.重启网络服务 3.联系网络运营商 | 30-60分钟 |
|
|||
|
|
|
|||
|
|
### 6.3 故障处理脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# emergency-fix.sh - 紧急故障处理脚本
|
|||
|
|
|
|||
|
|
# 服务健康检查
|
|||
|
|
check_service_health() {
|
|||
|
|
echo "检查服务健康状态..."
|
|||
|
|
|
|||
|
|
# 检查API服务
|
|||
|
|
if ! curl -f http://localhost:3000/health > /dev/null 2>&1; then
|
|||
|
|
echo "API服务异常,尝试重启..."
|
|||
|
|
pm2 restart jiebanke-api
|
|||
|
|
sleep 10
|
|||
|
|
|
|||
|
|
if curl -f http://localhost:3000/health > /dev/null 2>&1; then
|
|||
|
|
echo "API服务重启成功"
|
|||
|
|
else
|
|||
|
|
echo "API服务重启失败,需要人工介入"
|
|||
|
|
return 1
|
|||
|
|
fi
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 检查数据库连接
|
|||
|
|
if ! mysql -u root -p"$DB_PASSWORD" -e "SELECT 1" > /dev/null 2>&1; then
|
|||
|
|
echo "数据库连接异常"
|
|||
|
|
return 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
# 检查Redis连接
|
|||
|
|
if ! redis-cli ping > /dev/null 2>&1; then
|
|||
|
|
echo "Redis连接异常,尝试重启..."
|
|||
|
|
systemctl restart redis
|
|||
|
|
sleep 5
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
echo "服务健康检查完成"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 清理系统资源
|
|||
|
|
cleanup_system() {
|
|||
|
|
echo "清理系统资源..."
|
|||
|
|
|
|||
|
|
# 清理日志文件
|
|||
|
|
find /opt/jiebanke/logs -name "*.log" -size +100M -exec truncate -s 50M {} \;
|
|||
|
|
|
|||
|
|
# 清理临时文件
|
|||
|
|
find /tmp -name "jiebanke-*" -mtime +1 -delete
|
|||
|
|
|
|||
|
|
# 清理缓存
|
|||
|
|
echo "FLUSHDB" | redis-cli
|
|||
|
|
|
|||
|
|
# 重启服务释放内存
|
|||
|
|
pm2 restart all
|
|||
|
|
|
|||
|
|
echo "系统资源清理完成"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 故障恢复
|
|||
|
|
emergency_recovery() {
|
|||
|
|
local issue_type=$1
|
|||
|
|
|
|||
|
|
case "$issue_type" in
|
|||
|
|
"high_cpu")
|
|||
|
|
echo "处理CPU使用率过高..."
|
|||
|
|
# 重启服务
|
|||
|
|
pm2 restart all
|
|||
|
|
;;
|
|||
|
|
"high_memory")
|
|||
|
|
echo "处理内存使用率过高..."
|
|||
|
|
cleanup_system
|
|||
|
|
;;
|
|||
|
|
"disk_full")
|
|||
|
|
echo "处理磁盘空间不足..."
|
|||
|
|
# 清理日志
|
|||
|
|
find /var/log -name "*.log" -mtime +7 -delete
|
|||
|
|
find /opt/jiebanke/logs -name "*.log.gz" -mtime +30 -delete
|
|||
|
|
;;
|
|||
|
|
"service_down")
|
|||
|
|
echo "处理服务宕机..."
|
|||
|
|
check_service_health
|
|||
|
|
;;
|
|||
|
|
*)
|
|||
|
|
echo "未知故障类型: $issue_type"
|
|||
|
|
return 1
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 发送故障通知
|
|||
|
|
send_alert() {
|
|||
|
|
local message=$1
|
|||
|
|
local level=$2
|
|||
|
|
|
|||
|
|
# 发送钉钉告警
|
|||
|
|
curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d "{
|
|||
|
|
\"msgtype\": \"text\",
|
|||
|
|
\"text\": {
|
|||
|
|
\"content\": \"【$level】$message\"
|
|||
|
|
},
|
|||
|
|
\"at\": {
|
|||
|
|
\"isAtAll\": true
|
|||
|
|
}
|
|||
|
|
}"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 主函数
|
|||
|
|
main() {
|
|||
|
|
local action=$1
|
|||
|
|
|
|||
|
|
case "$action" in
|
|||
|
|
"check")
|
|||
|
|
check_service_health
|
|||
|
|
;;
|
|||
|
|
"cleanup")
|
|||
|
|
cleanup_system
|
|||
|
|
;;
|
|||
|
|
"recover")
|
|||
|
|
emergency_recovery "$2"
|
|||
|
|
;;
|
|||
|
|
"alert")
|
|||
|
|
send_alert "$2" "$3"
|
|||
|
|
;;
|
|||
|
|
*)
|
|||
|
|
echo "用法: $0 {check|cleanup|recover|alert}"
|
|||
|
|
echo " check - 检查服务健康状态"
|
|||
|
|
echo " cleanup - 清理系统资源"
|
|||
|
|
echo " recover <type> - 故障恢复"
|
|||
|
|
echo " alert <message> <level> - 发送告警"
|
|||
|
|
exit 1
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
main "$@"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 7. 安全运维
|
|||
|
|
|
|||
|
|
### 7.1 安全检查清单
|
|||
|
|
|
|||
|
|
- [ ] 系统补丁更新
|
|||
|
|
- [ ] 防火墙规则检查
|
|||
|
|
- [ ] SSL证书有效性
|
|||
|
|
- [ ] 密码策略检查
|
|||
|
|
- [ ] 访问日志审计
|
|||
|
|
- [ ] 漏洞扫描
|
|||
|
|
- [ ] 备份数据加密
|
|||
|
|
- [ ] 网络安全监控
|
|||
|
|
|
|||
|
|
### 7.2 安全加固脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# security-hardening.sh - 安全加固脚本
|
|||
|
|
|
|||
|
|
# 系统安全加固
|
|||
|
|
system_hardening() {
|
|||
|
|
echo "开始系统安全加固..."
|
|||
|
|
|
|||
|
|
# 禁用不必要的服务
|
|||
|
|
systemctl disable telnet
|
|||
|
|
systemctl disable rsh
|
|||
|
|
systemctl disable rlogin
|
|||
|
|
|
|||
|
|
# 设置文件权限
|
|||
|
|
chmod 600 /etc/shadow
|
|||
|
|
chmod 600 /etc/gshadow
|
|||
|
|
chmod 644 /etc/passwd
|
|||
|
|
chmod 644 /etc/group
|
|||
|
|
|
|||
|
|
# 配置防火墙
|
|||
|
|
firewall-cmd --permanent --add-service=http
|
|||
|
|
firewall-cmd --permanent --add-service=https
|
|||
|
|
firewall-cmd --permanent --add-port=3000/tcp
|
|||
|
|
firewall-cmd --reload
|
|||
|
|
|
|||
|
|
echo "系统安全加固完成"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 应用安全检查
|
|||
|
|
application_security_check() {
|
|||
|
|
echo "开始应用安全检查..."
|
|||
|
|
|
|||
|
|
# 检查敏感文件权限
|
|||
|
|
find /opt/jiebanke -name "*.key" -exec chmod 600 {} \;
|
|||
|
|
find /opt/jiebanke -name "*.pem" -exec chmod 600 {} \;
|
|||
|
|
|
|||
|
|
# 检查配置文件中的敏感信息
|
|||
|
|
grep -r "password\|secret\|key" /opt/jiebanke/config/ || true
|
|||
|
|
|
|||
|
|
# 检查开放端口
|
|||
|
|
netstat -tlnp | grep LISTEN
|
|||
|
|
|
|||
|
|
echo "应用安全检查完成"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 日志安全审计
|
|||
|
|
security_audit() {
|
|||
|
|
echo "开始安全审计..."
|
|||
|
|
|
|||
|
|
# 检查登录失败记录
|
|||
|
|
grep "Failed password" /var/log/secure | tail -20
|
|||
|
|
|
|||
|
|
# 检查异常访问
|
|||
|
|
awk '$9 ~ /4[0-9][0-9]|5[0-9][0-9]/ {print $1, $7, $9}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -20
|
|||
|
|
|
|||
|
|
# 检查SQL注入尝试
|
|||
|
|
grep -i "union\|select\|drop\|insert" /var/log/nginx/access.log | head -10
|
|||
|
|
|
|||
|
|
echo "安全审计完成"
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
main() {
|
|||
|
|
case "$1" in
|
|||
|
|
"harden")
|
|||
|
|
system_hardening
|
|||
|
|
application_security_check
|
|||
|
|
;;
|
|||
|
|
"audit")
|
|||
|
|
security_audit
|
|||
|
|
;;
|
|||
|
|
"all")
|
|||
|
|
system_hardening
|
|||
|
|
application_security_check
|
|||
|
|
security_audit
|
|||
|
|
;;
|
|||
|
|
*)
|
|||
|
|
echo "用法: $0 {harden|audit|all}"
|
|||
|
|
exit 1
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
main "$@"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 8. 容量规划
|
|||
|
|
|
|||
|
|
### 8.1 资源使用趋势分析
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
#!/usr/bin/env python3
|
|||
|
|
# capacity-planning.py - 容量规划分析
|
|||
|
|
|
|||
|
|
import pandas as pd
|
|||
|
|
import numpy as np
|
|||
|
|
from datetime import datetime, timedelta
|
|||
|
|
import matplotlib.pyplot as plt
|
|||
|
|
|
|||
|
|
class CapacityPlanner:
|
|||
|
|
def __init__(self):
|
|||
|
|
self.metrics_data = {}
|
|||
|
|
|
|||
|
|
def load_metrics(self, metric_type, days=30):
|
|||
|
|
"""加载指标数据"""
|
|||
|
|
# 这里应该从Prometheus或其他监控系统获取数据
|
|||
|
|
# 示例数据生成
|
|||
|
|
dates = pd.date_range(end=datetime.now(), periods=days*24, freq='H')
|
|||
|
|
|
|||
|
|
if metric_type == 'cpu':
|
|||
|
|
data = np.random.normal(45, 15, len(dates))
|
|||
|
|
elif metric_type == 'memory':
|
|||
|
|
data = np.random.normal(60, 20, len(dates))
|
|||
|
|
elif metric_type == 'disk':
|
|||
|
|
data = np.random.normal(30, 10, len(dates))
|
|||
|
|
else:
|
|||
|
|
data = np.random.normal(50, 10, len(dates))
|
|||
|
|
|
|||
|
|
return pd.DataFrame({'timestamp': dates, 'value': data})
|
|||
|
|
|
|||
|
|
def predict_usage(self, metric_type, days_ahead=30):
|
|||
|
|
"""预测资源使用趋势"""
|
|||
|
|
df = self.load_metrics(metric_type)
|
|||
|
|
|
|||
|
|
# 简单线性回归预测
|
|||
|
|
x = np.arange(len(df))
|
|||
|
|
y = df['value'].values
|
|||
|
|
|
|||
|
|
# 计算趋势
|
|||
|
|
z = np.polyfit(x, y, 1)
|
|||
|
|
trend = np.poly1d(z)
|
|||
|
|
|
|||
|
|
# 预测未来使用量
|
|||
|
|
future_x = np.arange(len(df), len(df) + days_ahead*24)
|
|||
|
|
future_values = trend(future_x)
|
|||
|
|
|
|||
|
|
return {
|
|||
|
|
'current_avg': np.mean(y[-7*24:]), # 最近7天平均值
|
|||
|
|
'predicted_avg': np.mean(future_values),
|
|||
|
|
'trend_slope': z[0],
|
|||
|
|
'max_predicted': np.max(future_values)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
def generate_report(self):
|
|||
|
|
"""生成容量规划报告"""
|
|||
|
|
metrics = ['cpu', 'memory', 'disk']
|
|||
|
|
report = {}
|
|||
|
|
|
|||
|
|
for metric in metrics:
|
|||
|
|
prediction = self.predict_usage(metric)
|
|||
|
|
report[metric] = prediction
|
|||
|
|
|
|||
|
|
return report
|
|||
|
|
|
|||
|
|
def check_capacity_alerts(self, report):
|
|||
|
|
"""检查容量告警"""
|
|||
|
|
alerts = []
|
|||
|
|
|
|||
|
|
for metric, data in report.items():
|
|||
|
|
if data['max_predicted'] > 80:
|
|||
|
|
alerts.append(f"{metric}使用率预计将超过80%,当前趋势: {data['trend_slope']:.2f}%/小时")
|
|||
|
|
elif data['max_predicted'] > 70:
|
|||
|
|
alerts.append(f"{metric}使用率预计将超过70%,建议关注")
|
|||
|
|
|
|||
|
|
return alerts
|
|||
|
|
|
|||
|
|
if __name__ == "__main__":
|
|||
|
|
planner = CapacityPlanner()
|
|||
|
|
report = planner.generate_report()
|
|||
|
|
alerts = planner.check_capacity_alerts(report)
|
|||
|
|
|
|||
|
|
print("=== 容量规划报告 ===")
|
|||
|
|
for metric, data in report.items():
|
|||
|
|
print(f"{metric.upper()}:")
|
|||
|
|
print(f" 当前平均使用率: {data['current_avg']:.1f}%")
|
|||
|
|
print(f" 预测平均使用率: {data['predicted_avg']:.1f}%")
|
|||
|
|
print(f" 预测最大使用率: {data['max_predicted']:.1f}%")
|
|||
|
|
print(f" 增长趋势: {data['trend_slope']:.2f}%/小时")
|
|||
|
|
print()
|
|||
|
|
|
|||
|
|
if alerts:
|
|||
|
|
print("=== 容量告警 ===")
|
|||
|
|
for alert in alerts:
|
|||
|
|
print(f"⚠️ {alert}")
|
|||
|
|
else:
|
|||
|
|
print("✅ 当前容量充足,无需扩容")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 9. 运维自动化
|
|||
|
|
|
|||
|
|
### 9.1 自动化运维流程
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# .github/workflows/ops-automation.yml
|
|||
|
|
name: 运维自动化
|
|||
|
|
|
|||
|
|
on:
|
|||
|
|
schedule:
|
|||
|
|
- cron: '0 2 * * *' # 每天凌晨2点执行
|
|||
|
|
workflow_dispatch:
|
|||
|
|
|
|||
|
|
jobs:
|
|||
|
|
daily-maintenance:
|
|||
|
|
runs-on: ubuntu-latest
|
|||
|
|
steps:
|
|||
|
|
- name: 系统健康检查
|
|||
|
|
run: |
|
|||
|
|
curl -f ${{ secrets.HEALTH_CHECK_URL }} || exit 1
|
|||
|
|
|
|||
|
|
- name: 清理日志文件
|
|||
|
|
run: |
|
|||
|
|
ssh ${{ secrets.SERVER_HOST }} "find /opt/jiebanke/logs -name '*.log' -size +100M -exec truncate -s 50M {} \;"
|
|||
|
|
|
|||
|
|
- name: 数据库优化
|
|||
|
|
run: |
|
|||
|
|
ssh ${{ secrets.SERVER_HOST }} "mysql -u root -p${{ secrets.DB_PASSWORD }} -e 'OPTIMIZE TABLE jiebanke.users, jiebanke.trips;'"
|
|||
|
|
|
|||
|
|
- name: 缓存预热
|
|||
|
|
run: |
|
|||
|
|
curl -X POST ${{ secrets.API_URL }}/admin/cache/warmup
|
|||
|
|
|
|||
|
|
- name: 发送运维报告
|
|||
|
|
run: |
|
|||
|
|
curl -X POST ${{ secrets.DINGTALK_WEBHOOK }} \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{"msgtype": "text", "text": {"content": "日常运维任务执行完成"}}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 9.2 运维工具集
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# ops-toolkit.sh - 运维工具集
|
|||
|
|
|
|||
|
|
# 快速诊断工具
|
|||
|
|
quick_diagnosis() {
|
|||
|
|
echo "=== 系统快速诊断 ==="
|
|||
|
|
|
|||
|
|
# 系统负载
|
|||
|
|
echo "系统负载:"
|
|||
|
|
uptime
|
|||
|
|
|
|||
|
|
# 内存使用
|
|||
|
|
echo -e "\n内存使用:"
|
|||
|
|
free -h
|
|||
|
|
|
|||
|
|
# 磁盘使用
|
|||
|
|
echo -e "\n磁盘使用:"
|
|||
|
|
df -h
|
|||
|
|
|
|||
|
|
# 网络连接
|
|||
|
|
echo -e "\n网络连接:"
|
|||
|
|
netstat -an | grep :3000
|
|||
|
|
|
|||
|
|
# 进程状态
|
|||
|
|
echo -e "\n进程状态:"
|
|||
|
|
pm2 status
|
|||
|
|
|
|||
|
|
# 数据库状态
|
|||
|
|
echo -e "\n数据库状态:"
|
|||
|
|
mysql -u root -p"$DB_PASSWORD" -e "SHOW PROCESSLIST;" | head -10
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 性能分析工具
|
|||
|
|
performance_analysis() {
|
|||
|
|
echo "=== 性能分析 ==="
|
|||
|
|
|
|||
|
|
# CPU使用率Top10
|
|||
|
|
echo "CPU使用率Top10:"
|
|||
|
|
ps aux --sort=-%cpu | head -11
|
|||
|
|
|
|||
|
|
# 内存使用率Top10
|
|||
|
|
echo -e "\n内存使用率Top10:"
|
|||
|
|
ps aux --sort=-%mem | head -11
|
|||
|
|
|
|||
|
|
# IO等待
|
|||
|
|
echo -e "\nIO等待:"
|
|||
|
|
iostat -x 1 3
|
|||
|
|
|
|||
|
|
# 网络流量
|
|||
|
|
echo -e "\n网络流量:"
|
|||
|
|
iftop -t -s 10
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 日志分析工具
|
|||
|
|
log_analysis() {
|
|||
|
|
local log_type=$1
|
|||
|
|
local lines=${2:-100}
|
|||
|
|
|
|||
|
|
case "$log_type" in
|
|||
|
|
"error")
|
|||
|
|
echo "最近错误日志:"
|
|||
|
|
tail -n $lines /opt/jiebanke/logs/error.log
|
|||
|
|
;;
|
|||
|
|
"access")
|
|||
|
|
echo "访问日志统计:"
|
|||
|
|
tail -n $lines /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -10
|
|||
|
|
;;
|
|||
|
|
"slow")
|
|||
|
|
echo "慢查询日志:"
|
|||
|
|
tail -n $lines /var/log/mysql/slow.log
|
|||
|
|
;;
|
|||
|
|
*)
|
|||
|
|
echo "支持的日志类型: error, access, slow"
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 服务管理工具
|
|||
|
|
service_management() {
|
|||
|
|
local action=$1
|
|||
|
|
local service=$2
|
|||
|
|
|
|||
|
|
case "$action" in
|
|||
|
|
"restart")
|
|||
|
|
echo "重启服务: $service"
|
|||
|
|
if [ "$service" = "api" ]; then
|
|||
|
|
pm2 restart jiebanke-api
|
|||
|
|
elif [ "$service" = "nginx" ]; then
|
|||
|
|
systemctl restart nginx
|
|||
|
|
elif [ "$service" = "mysql" ]; then
|
|||
|
|
systemctl restart mysql
|
|||
|
|
elif [ "$service" = "redis" ]; then
|
|||
|
|
systemctl restart redis
|
|||
|
|
else
|
|||
|
|
echo "未知服务: $service"
|
|||
|
|
fi
|
|||
|
|
;;
|
|||
|
|
"status")
|
|||
|
|
echo "服务状态: $service"
|
|||
|
|
if [ "$service" = "api" ]; then
|
|||
|
|
pm2 status jiebanke-api
|
|||
|
|
else
|
|||
|
|
systemctl status $service
|
|||
|
|
fi
|
|||
|
|
;;
|
|||
|
|
*)
|
|||
|
|
echo "支持的操作: restart, status"
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 主菜单
|
|||
|
|
show_menu() {
|
|||
|
|
echo "=== 解班客运维工具集 ==="
|
|||
|
|
echo "1. 快速诊断"
|
|||
|
|
echo "2. 性能分析"
|
|||
|
|
echo "3. 日志分析"
|
|||
|
|
echo "4. 服务管理"
|
|||
|
|
echo "5. 退出"
|
|||
|
|
echo -n "请选择操作: "
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 主函数
|
|||
|
|
main() {
|
|||
|
|
if [ $# -eq 0 ]; then
|
|||
|
|
# 交互模式
|
|||
|
|
while true; do
|
|||
|
|
show_menu
|
|||
|
|
read choice
|
|||
|
|
case $choice in
|
|||
|
|
1) quick_diagnosis ;;
|
|||
|
|
2) performance_analysis ;;
|
|||
|
|
3)
|
|||
|
|
echo -n "日志类型 (error/access/slow): "
|
|||
|
|
read log_type
|
|||
|
|
log_analysis $log_type
|
|||
|
|
;;
|
|||
|
|
4)
|
|||
|
|
echo -n "操作 (restart/status): "
|
|||
|
|
read action
|
|||
|
|
echo -n "服务名: "
|
|||
|
|
read service
|
|||
|
|
service_management $action $service
|
|||
|
|
;;
|
|||
|
|
5) exit 0 ;;
|
|||
|
|
*) echo "无效选择" ;;
|
|||
|
|
esac
|
|||
|
|
echo -e "\n按回车键继续..."
|
|||
|
|
read
|
|||
|
|
done
|
|||
|
|
else
|
|||
|
|
# 命令行模式
|
|||
|
|
case "$1" in
|
|||
|
|
"diagnosis") quick_diagnosis ;;
|
|||
|
|
"performance") performance_analysis ;;
|
|||
|
|
"log") log_analysis "$2" "$3" ;;
|
|||
|
|
"service") service_management "$2" "$3" ;;
|
|||
|
|
*)
|
|||
|
|
echo "用法: $0 [diagnosis|performance|log|service]"
|
|||
|
|
exit 1
|
|||
|
|
;;
|
|||
|
|
esac
|
|||
|
|
fi
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
main "$@"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 10. 总结
|
|||
|
|
|
|||
|
|
本运维文档涵盖了解班客项目的全面运维管理,包括:
|
|||
|
|
|
|||
|
|
### 10.1 核心运维能力
|
|||
|
|
- **监控告警**:全方位系统监控和智能告警
|
|||
|
|
- **日志管理**:集中化日志收集和分析
|
|||
|
|
- **性能优化**:数据库、缓存、应用性能调优
|
|||
|
|
- **备份恢复**:自动化备份和快速恢复机制
|
|||
|
|
- **故障处理**:标准化故障响应和处理流程
|
|||
|
|
|
|||
|
|
### 10.2 自动化程度
|
|||
|
|
- **监控自动化**:自动监控、告警、报告
|
|||
|
|
- **备份自动化**:定时备份、验证、清理
|
|||
|
|
- **部署自动化**:CI/CD集成、自动部署
|
|||
|
|
- **运维自动化**:日常维护任务自动化
|
|||
|
|
|
|||
|
|
### 10.3 运维最佳实践
|
|||
|
|
- **预防为主**:通过监控和预警防范故障
|
|||
|
|
- **快速响应**:建立完善的故障响应机制
|
|||
|
|
- **持续改进**:定期回顾和优化运维流程
|
|||
|
|
- **文档驱动**:完善的运维文档和知识库
|
|||
|
|
|
|||
|
|
通过实施本运维方案,可以确保解班客项目的高可用性、高性能和高安全性。
|