proxmox-infrastruktur/docs/TROUBLESHOOTING.md

# Troubleshooting Guide

## Inhaltsverzeichnis

1. [Container Probleme](#1-container-probleme)
2. [Netzwerk Probleme](#2-netzwerk-probleme)
3. [SSL/Zertifikat Probleme](#3-sslzertifikat-probleme)
4. [Service-spezifische Probleme](#4-service-spezifische-probleme)
5. [Backup/Restore Probleme](#5-backuprestore-probleme)
6. [Performance Probleme](#6-performance-probleme)
7. [VM und Storage Probleme](#7-vm-und-storage-probleme)
8. [Stolperfallen und Lessons Learned](#8-stolperfallen-und-lessons-learned)

---

## 1. Container Probleme

### Container startet nicht

**Symptom:** `docker compose up -d` laeuft, aber Container ist nicht aktiv

```bash
# Status pruefen
docker compose ps

# Logs anzeigen
docker logs <container-name>

# Detaillierte Infos
docker inspect <container-name>
```

**Haeufige Ursachen:**

1. **Port bereits belegt**
   ```bash
   # Port pruefen
   netstat -tulpn | grep <port>

   # Prozess beenden oder Port aendern
   ```

2. **Volume-Berechtigungen**
   ```bash
   # Berechtigungen korrigieren
   chown -R 1000:1000 /opt/docker/<service>
   chmod -R 755 /opt/docker/<service>
   ```

3. **Fehlende .env Datei**
   ```bash
   # Pruefen ob .env existiert
   ls -la /opt/docker/.env

   # Aus Template erstellen
   cp docker/.env.template docker/.env
   ```

### "Permission denied" auf Proxmox

**Symptom:** `socketpair() failed (13: Permission denied)`

**Loesung:** security_opt in docker-compose.yml:
```yaml
security_opt:
  - apparmor=unconfined
  - seccomp=unconfined
```

### Container restarts staendig

**Symptom:** Container Status zeigt "Restarting"

```bash
# Exit-Code pruefen
docker inspect --format='{{.State.ExitCode}}' <container>

# Letzten Fehler anzeigen
docker logs --tail 50 <container>

# Health-Check deaktivieren zum Debuggen
docker compose up -d --no-healthcheck <service>
```

---

## 2. Netzwerk Probleme

### WireGuard Tunnel nicht verbunden

**Symptom:** Keine Verbindung zu 10.0.0.x Adressen

```bash
# WireGuard Status
wg show wg0

# Interface pruefen
ip addr show wg0

# Tunnel neu starten
systemctl restart wg-quick@wg0

# Logs pruefen
journalctl -u wg-quick@wg0 -n 50
```

**Checkliste:**
- [ ] PrivateKey/PublicKey korrekt?
- [ ] Endpoint IP:Port erreichbar?
- [ ] Firewall-Regeln auf VPS?
- [ ] PersistentKeepalive gesetzt?

### Service nicht extern erreichbar

```bash
# 1. Container laeuft?
docker ps | grep <service>

# 2. Port offen auf Proxmox?
curl http://localhost:<port>

# 3. WireGuard Tunnel aktiv?
ping 10.0.0.1  # VPS von Proxmox

# 4. nginx Config auf VPS testen
cd C:\nginx && nginx.exe -t

# 5. nginx neu laden
net stop nginx && net start nginx
```

### DNS-Probleme

```bash
# DuckDNS IP pruefen
nslookup eckardt-vault.duckdns.org

# Eigene externe IP pruefen
curl ifconfig.me

# DuckDNS manuell aktualisieren
curl "https://www.duckdns.org/update?domains=eckardt-vault&token=<TOKEN>&ip="
```

---

## 3. SSL/Zertifikat Probleme

### Zertifikat abgelaufen

**Auf Windows VPS:**
```cmd
# Zertifikat erneuern
cd C:\winacme
wacs.exe --renew --force

# nginx neu laden
net stop nginx && net start nginx
```

### Let's Encrypt Rate Limit

**Symptom:** "too many certificates already issued"

**Loesung:**
- 5 Zertifikate pro Domain pro Woche
- Warten oder Subdomain aendern
- Staging-Umgebung zum Testen nutzen

### Mixed Content Warnung

**Symptom:** Browser zeigt "unsichere Inhalte"

**Loesung:** Alle Services muessen HTTPS nutzen
```nginx
# In nginx.conf
proxy_set_header X-Forwarded-Proto $scheme;
```

---

## 4. Service-spezifische Probleme

### Nextcloud

**"Maintenance mode is enabled"**
```bash
docker exec nextcloud php occ maintenance:mode --off
```

**Datei-Upload schlaegt fehl**
```bash
# PHP Limits anpassen
docker exec nextcloud bash -c 'echo "upload_max_filesize=10G" >> /usr/local/etc/php/conf.d/uploads.ini'
docker exec nextcloud bash -c 'echo "post_max_size=10G" >> /usr/local/etc/php/conf.d/uploads.ini'
docker restart nextcloud
```

**"Trusted Domain" Fehler**
```bash
docker exec nextcloud php occ config:system:set trusted_domains 1 --value=eckardt-cloud.duckdns.org
```

### Vaultwarden

**Admin-Seite nicht erreichbar**
```bash
# Admin-Token pruefen
docker logs vaultwarden | grep -i admin

# URL: /admin mit Token aus .env
```

**Sync-Fehler in Clients**
```bash
# Verbindung testen
curl -v https://eckardt-vault.duckdns.org/vault/api/alive
```

### Gitea

**SSH Clone funktioniert nicht**
```bash
# SSH-Verbindung testen
ssh -T -p 2222 git@192.168.178.111

# Authorized Keys pruefen
docker exec gitea cat /data/git/.ssh/authorized_keys
```

**"Unable to find user" nach Restart**
```bash
# Gitea User pruefen
docker exec gitea gitea admin user list
```

### n8n

**Webhooks funktionieren nicht**
```bash
# Webhook-URL pruefen
# Muss WEBHOOK_URL in .env auf externe URL zeigen
docker logs n8n | grep -i webhook
```

---

## 5. Backup/Restore Probleme

### Backup schlaegt fehl

```bash
# Berechtigungen pruefen
ls -la /opt/backups/

# Speicherplatz pruefen
df -h /opt/backups/

# Manuell testen
/opt/scripts/backup.sh nextcloud
```

### Restore durchfuehren

```bash
# Container stoppen
docker compose stop <service>

# Altes Volume loeschen
rm -rf /opt/docker/<service>/*

# Backup entpacken
tar -xzf /opt/backups/<service>_YYYYMMDD.tar.gz -C /opt/docker/<service>/

# Container starten
docker compose up -d <service>
```

---

## 6. Performance Probleme

### Hohe CPU-Last

```bash
# Top Prozesse
docker stats --no-stream

# Ressourcen-Limits pruefen
docker inspect --format='{{.HostConfig.NanoCpus}}' <container>
```

### Speicher voll

```bash
# Docker Cleanup
docker system prune -a --volumes

# Alte Logs loeschen
truncate -s 0 /var/lib/docker/containers/*/*-json.log

# Alte Backups loeschen
find /opt/backups -mtime +30 -delete
```

### Langsame Antwortzeiten

```bash
# Netzwerk-Latenz testen
ping -c 10 10.0.0.2

# Container-Ressourcen
docker stats <container>

# Disk I/O
iostat -x 1 5
```

---

## Diagnose-Befehle Uebersicht

```bash
# Alle Container Status
docker compose ps

# Alle Logs (live)
docker compose logs -f

# Ressourcen-Nutzung
docker stats

# Netzwerke anzeigen
docker network ls

# Volumes anzeigen
docker volume ls

# System-Info
docker system df

# Health-Check ausfuehren
/opt/scripts/health-check.sh
```

---

## 7. VM und Storage Probleme

### VM Snapshots funktionieren nicht

**Problem:** `qm snapshot` meldet "snapshot feature is not available"

**Ursache:** Disk ist als Raw Device (`/dev/pve/...`) statt als Proxmox-managed Disk eingebunden

**Diagnose:**
```bash
# VM Konfiguration pruefen
qm config 100 | grep scsi

# Falsch (Raw Device - keine Snapshots):
# scsi1: /dev/pve/vm-100-data,size=200G

# Richtig (Proxmox-managed - Snapshots moeglich):
# scsi1: local-lvm:vm-100-data,size=200G
```

**Loesung:**
```bash
# 1. VM stoppen
qm stop 100

# 2. Raw Device entfernen
qm set 100 --delete scsi1

# 3. Als Proxmox-managed Disk neu hinzufuegen
qm set 100 --scsi1 local-lvm:vm-100-data

# 4. VM starten
qm start 100
```

---

### Storage-Uebersicht und Disk Migration

**Aktuelles Storage-Layout:**

| Storage | NVMe | Verwendung | Kapazitaet |
|---------|------|------------|------------|
| `local-lvm` | nvme0n1 (WDC) | VM System Disks | ~350GB Thin Pool |
| `nvme-data` | nvme1n1 (SKHynix) | Nextcloud/Data | ~450GB Thin Pool |

**Storage Status pruefen:**
```bash
pvesm status
```

**Disk zwischen Storages verschieben (Live-Migration):**
```bash
# Disk von local-lvm nach nvme-data verschieben
# --delete 1 = altes Volume nach Migration loeschen
qm disk move 100 scsi1 nvme-data --delete 1
```

---

### VM Snapshot Befehle

```bash
# Snapshot erstellen
qm snapshot 100 <name> --description "Beschreibung"

# Snapshots auflisten
qm listsnapshot 100

# Zu Snapshot zurueckkehren (VM wird neugestartet)
qm rollback 100 <name>

# Snapshot loeschen
qm delsnapshot 100 <name>
```

**Hinweis:** Warnung "QEMU Guest Agent is not running" ist nicht kritisch. Fuer konsistentere Snapshots kann `qemu-guest-agent` in der VM installiert werden:
```bash
apt install qemu-guest-agent
systemctl enable qemu-guest-agent
systemctl start qemu-guest-agent
```

---

### Thin Pool Warnungen

**Problem:** `WARNING: Sum of all thin volume sizes exceeds the size of thin pool`

**Ursache:** Thin Provisioning erlaubt Overprovisioning - die virtuellen Volumes sind groesser als der physische Speicher

**Loesung:** Dies ist normal bei Thin Provisioning. Wichtig ist, den tatsaechlichen Verbrauch zu ueberwachen:
```bash
# Tatsaechliche Nutzung pruefen
lvs -o lv_name,lv_size,data_percent

# Thin Pool Status
lvs pve/data -o lv_size,data_percent,metadata_percent
```

---

## 8. Stolperfallen und Lessons Learned

### nginx auf Windows

**Problem:** `listen 443 ssl http2;` ist deprecated

**Loesung:** Neue Syntax verwenden:
```nginx
listen 443 ssl;
http2 on;
```

**Problem:** `proxy_max_temp_file_size` Direktive funktioniert nicht

**Loesung:** Direktive entfernen oder auf gueltigen Wert setzen (z.B. `0` zum Deaktivieren)

---

### Docker User Namespace Remapping

**Problem:** Container starten nicht nach Aktivierung von `userns-remap`

**Ursache:** Bestehende Volumes haben falsche Berechtigungen fuer den remapped User

**Loesung:**
- Option 1: `userns-remap` nicht verwenden fuer bestehende Installationen
- Option 2: Alle Volume-Berechtigungen anpassen (aufwendig)

```json
// /etc/docker/daemon.json - OHNE userns-remap fuer bestehende Volumes
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  },
  "no-new-privileges": true,
  "live-restore": true
}
```

---

### MariaDB Health Checks

**Problem:** `healthcheck.sh --connect --innodb_initialized` schlaegt fehl

**Ursache:** Health Check versucht root-Login ohne Passwort

**Loesung:** Einfachen Health Check verwenden oder ganz weglassen:
```yaml
# Option 1: Einfacher TCP Check
healthcheck:
  test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-u", "nextcloud", "-pNextcloudDB123!"]

# Option 2: Kein Health Check (Container verlaesst sich auf depends_on)
# healthcheck: weglassen
```

---

### Gitea Subpath vs Subdomain

**Problem:** Gitea unter Subpath (z.B. `/git/`) laedt Assets nicht

**Ursache:** Gitea generiert absolute Asset-Pfade basierend auf ROOT_URL

**Loesung:** Immer eigene Subdomain verwenden:
- ❌ `https://example.com/git/` - funktioniert nicht zuverlaessig
- ✅ `https://git.example.com/` - funktioniert

---

### SSH Key-Only nach Passwort-Deaktivierung

**Problem:** Ausgesperrt nach Deaktivierung von PasswordAuthentication

**Praevention:**
1. IMMER erst SSH Key testen bevor Passwort deaktiviert wird
2. Backup-Zugang via Proxmox Console behalten

**Notfall-Recovery:**
```bash
# Via Proxmox Web Console (https://192.168.178.111:8006)
# 1. Shell oeffnen
# 2. Passwort-Auth wieder aktivieren
sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
systemctl restart sshd
```

---

### UFW und Docker

**Problem:** UFW Regeln werden von Docker ignoriert

**Ursache:** Docker manipuliert iptables direkt

**Loesung:** Docker-Ports nur auf localhost binden oder UFW-Docker Integration nutzen:
```yaml
# In docker-compose.yml - nur lokal erreichbar
ports:
  - "127.0.0.1:8080:80"
```

---

### Git Push zu Gitea schlaegt fehl

**Problem:** `Authentication failed` beim Push zu Gitea

**Ursache:** Git Credential Manager kann sich nicht authentifizieren

**Loesung:** API Token verwenden:
```bash
# Token in Gitea generieren:
# Settings -> Applications -> Generate New Token

# Push mit Token (einmalig)
git push https://USERNAME:TOKEN@eckardt-git.duckdns.org/USERNAME/REPO.git master

# Oder: Remote mit Token setzen (dauerhaft)
git remote set-url origin https://USERNAME:TOKEN@eckardt-git.duckdns.org/USERNAME/REPO.git
```

**Hinweis:** Token im Vaultwarden speichern!

---

### Rate Limiting Debugging

**Problem:** Unklar ob Rate Limiting greift

**Test:**
```bash
# Schnelle Anfragen senden
for i in {1..20}; do curl -s -o /dev/null -w "%{http_code}\n" https://eckardt-vault.duckdns.org/vault/; done

# Bei aktivem Limit: 503 nach einigen Anfragen
```

**nginx Logs pruefen (auf VPS):**
```cmd
type C:\nginx\logs\error.log | findstr "limiting"
```

---

### VT-x/KVM nicht verfuegbar

**Problem:** `KVM virtualisation configured, but not available`

**Symptom:** VMs starten nicht, `/dev/kvm` existiert nicht

**Ursache:** Intel VT-x oder AMD-V ist im BIOS deaktiviert

**Diagnose:**
```bash
# Pruefen ob vmx (Intel) oder svm (AMD) Flags vorhanden
egrep -c '(vmx|svm)' /proc/cpuinfo
# Ausgabe 0 = VT-x deaktiviert

# KVM Device pruefen
ls -la /dev/kvm
```

**Loesung fuer HP Z2 Workstation:**
1. Neustart, F10 fuer BIOS Setup
2. Security -> System Security
3. **Virtualization Technology (VT-x)** -> Enabled
4. **VT-d** -> Enabled (optional, fuer PCI Passthrough)
5. F10 zum Speichern und Beenden

**Loesung fuer andere Systeme:**
- BIOS/UEFI aufrufen (meist F2, F10, DEL beim Boot)
- Suche nach: "Virtualization", "VT-x", "AMD-V", "SVM"
- Aktivieren und speichern

**Nach Aktivierung:**
```bash
# Pruefen
ls -la /dev/kvm
egrep -c '(vmx|svm)' /proc/cpuinfo  # Sollte > 0 sein
```

---

## Kontakt / Hilfe

- **Gitea Issues:** https://eckardt-git.duckdns.org/Martin/proxmox-infrastruktur/issues
- **Docker Docs:** https://docs.docker.com/
- **Nextcloud Docs:** https://docs.nextcloud.com/
- **Vaultwarden Wiki:** https://github.com/dani-garcia/vaultwarden/wiki