IT Problems & Solutions — Complete Troubleshooting Guide

🌐

Networking Problems

28 problems

High Cannot connect to the internet — all pings fail

Problem

No internet access. ping 8.8.8.8 times out. Browser shows "No internet connection".

Diagnosis

ip addr show          # check if interface has IP
ip route show         # check default gateway exists
ping 192.168.1.1      # ping gateway — if fails, local issue
ping 8.8.8.8          # if gateway ok, DNS/ISP issue
cat /etc/resolv.conf  # check DNS servers

Solution — Step by Step

Restart network interface: sudo ip link set eth0 down && sudo ip link set eth0 up
Request new IP via DHCP: sudo dhclient -r && sudo dhclient eth0
Add Google DNS if resolv.conf is empty: echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
Add missing default route: sudo ip route add default via 192.168.1.1
If still failing, restart networking service: sudo systemctl restart NetworkManager

networkingdhcpdnslinux

High DNS resolution failing — domain not found

Problem

Websites fail with "server not found" but IP addresses work fine. nslookup google.com returns SERVFAIL.

Diagnosis

nslookup google.com           # test DNS resolution
nslookup google.com 8.8.8.8  # test with Google DNS directly
cat /etc/resolv.conf          # see configured DNS servers
systemd-resolve --status      # check systemd-resolved

Solution

Set reliable DNS servers: sudo nano /etc/resolv.conf → add nameserver 8.8.8.8 and nameserver 1.1.1.1
Flush DNS cache: sudo systemd-resolve --flush-caches
On Windows: ipconfig /flushdns
Restart DNS resolver: sudo systemctl restart systemd-resolved

dnsnetworkingresolv.conf

Medium High network latency and packet loss

Problem

Slow network, video calls drop, ping shows >200ms or packet loss.

Diagnosis

ping -c 50 8.8.8.8             # check packet loss %
mtr --report 8.8.8.8          # traceroute + ping combined
iperf3 -c iperf.he.net        # test bandwidth
ss -tulpn                     # check what's using bandwidth
nethogs                       # per-process bandwidth monitor

Solution

Identify bandwidth hogs: sudo nethogs eth0 — kill greedy processes
Check for duplex mismatch: ethtool eth0 — set to sudo ethtool -s eth0 duplex full speed 1000
Enable jumbo frames for LAN: sudo ip link set eth0 mtu 9000
If Wi-Fi: switch to 5GHz band or move closer to AP
If VPS: contact provider — could be noisy neighbor on shared host

latencypacket-lossmtr

Medium SSH connection refused on port 22

Problem

ssh user@host returns "Connection refused" or times out.

Diagnosis

nc -zv host 22          # test if port 22 open
nmap -p 22 host         # port scan
# on the server:
systemctl status sshd   # is SSH daemon running?
ss -tlnp | grep 22      # is it listening?
ufw status              # firewall blocking?

Solution

Start SSH daemon: sudo systemctl start sshd && sudo systemctl enable sshd
Allow port in firewall: sudo ufw allow 22/tcp
If port was changed, connect with: ssh -p 2222 user@host
Check SSH config: sudo sshd -T | grep port

sshfirewallport-22

Medium SSL certificate error in browser

Problem

"Your connection is not private" / NET::ERR_CERT_AUTHORITY_INVALID or certificate expired warning.

Diagnosis

openssl s_client -connect domain.com:443 2>/dev/null | openssl x509 -noout -dates
# Shows notBefore and notAfter dates
curl -vI https://domain.com 2>&1 | grep -E "expire|SSL|cert"

Solution

Check expiry: if expired, renew with sudo certbot renew --force-renewal
Auto-renew setup: sudo certbot renew --dry-run
Add cron: 0 12 * * * certbot renew --quiet
If self-signed cert: replace with Let's Encrypt free cert
Check system clock — wrong date causes SSL failures: timedatectl status

sslhttpscertbotcertificate

Low VPN connected but no internet access

Problem

VPN shows connected but browsing fails. Also known as "VPN tunnel all traffic" issue.

Solution

Check routing table: ip route show — look for 0.0.0.0/0 via VPN interface
Enable split tunneling in VPN client settings to allow non-VPN traffic
Add DNS bypass: set DNS to 8.8.8.8 in network settings while VPN is on
On OpenVPN: remove redirect-gateway def1 from config if you don't want all traffic routed

vpnroutingsplit-tunnel

High Port 80/443 not accessible from outside

Problem

Web server runs locally but external users can't reach the site.

Diagnosis

ss -tlnp | grep -E "80|443"      # is nginx/apache listening?
sudo ufw status                    # firewall rules
curl -I http://localhost            # local test
# From outside:
curl -I http://YOUR_IP

Solution

Open firewall: sudo ufw allow 80/tcp && sudo ufw allow 443/tcp
Check server binds to 0.0.0.0 not 127.0.0.1 in nginx/apache config
If cloud server (AWS/DO): add inbound rules in security group / firewall panel
Restart web server: sudo systemctl restart nginx

nginxfirewallufwweb-server

🔒

Security Problems

30 problems

High Server getting brute-forced via SSH

Problem

Hundreds of failed SSH login attempts in auth.log. Server is being scanned/attacked.

Diagnosis

sudo grep "Failed password" /var/log/auth.log | tail -20
sudo grep "Failed password" /var/log/auth.log | awk '{print $11}' | sort | uniq -c | sort -rn | head -10

Solution

Install fail2ban: sudo apt install fail2ban && sudo systemctl enable fail2ban
Disable password auth — use keys only: in /etc/ssh/sshd_config set PasswordAuthentication no
Change SSH port: Port 2222 in sshd_config (reduces noise significantly)
Allow only your IP: sudo ufw allow from YOUR_IP to any port 22
Reload SSH: sudo systemctl reload sshd

sshbrute-forcefail2banhardening

High Website defaced or injected with malware

Problem

Site shows unexpected content, redirects visitors, or Google marks it as dangerous.

Diagnosis

find /var/www -name "*.php" -newer /var/www/index.php -ls  # recently changed files
grep -r "eval(base64_decode" /var/www/                     # common malware pattern
grep -r "iframe" /var/www/ --include="*.html"              # injected iframes

Solution

Take site offline immediately to protect visitors
Restore from last known clean backup
Change all passwords: FTP, database, CMS admin, hosting panel
Update CMS/plugins to latest versions — patch the entry point
Add WAF: Cloudflare free plan blocks most attacks
Request Google reconsideration after cleanup

malwaredefacementcmsrecovery

High Ransomware encrypted files on server

Problem

Files renamed with unknown extension (.locked, .crypt). Ransom note left on server.

Solution

Immediately isolate the server — disconnect from network
Do NOT pay the ransom — no guarantee of decryption
Identify the ransomware family at nomoreransom.org — may have free decryptor
Restore from offline backup (why you should always have air-gapped backups)
Report to CISA (US) or your national cybercrime unit
After restore: patch the entry vector, enable EDR, set up immutable backups

ransomwareincident-responsebackup

Medium API keys accidentally pushed to GitHub

Problem

AWS keys, API tokens or passwords committed to a public or private repository.

Solution

Revoke the exposed key IMMEDIATELY — do this before anything else
Remove from Git history: git filter-repo --path secrets.env --invert-paths
Force push: git push origin --force --all
Add to .gitignore: echo ".env" >> .gitignore
Use environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault)
Enable GitHub secret scanning to catch this automatically in future

api-keysgitsecretsgithub

Medium Users logging in from suspicious locations

Problem

Account accessed from unknown country/IP. Possible credential compromise.

Solution

Force password reset for affected account immediately
Invalidate all existing sessions
Enable MFA — even if password is leaked, attacker can't log in
Check for other accounts using same password — change those too
Review audit logs for what the attacker accessed or changed
Set up geo-blocking or anomaly detection alerts

account-takeovermfaincident

Low Missing security headers on web application

Problem

securityheaders.com gives F grade. Missing CSP, HSTS, X-Frame-Options etc.

Solution — Add to Nginx config

add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy "default-src 'self'; script-src 'self' 'unsafe-inline' https://www.googletagmanager.com; style-src 'self' 'unsafe-inline';" always;

security-headersnginxcsphsts

🐧

Linux Problems

28 problems

High Disk 100% full — server unresponsive

Problem

df -h shows 100% usage. Applications crash, logs stop writing, server may become unreachable.

Find the culprit

df -h                                    # which partition is full
du -sh /* 2>/dev/null | sort -rh | head -10  # top directories by size
du -sh /var/log/* | sort -rh | head -10      # often logs are the issue
find / -name "*.log" -size +100M 2>/dev/null # big log files

Solution

Clear old logs: sudo journalctl --vacuum-size=100M
Remove old kernels: sudo apt autoremove --purge
Clean apt cache: sudo apt clean
Find and delete large temp files: sudo find /tmp -size +50M -delete
Truncate large log file (safe): sudo truncate -s 0 /var/log/syslog
Set log rotation: configure /etc/logrotate.conf

disk-fulllogsstoragelinux

High CPU at 100% — server crawling

Diagnosis

top                           # see what's eating CPU (press P to sort by CPU)
ps aux --sort=-%cpu | head -10  # top CPU processes
htop                          # visual alternative
iotop                         # if it's I/O wait, not CPU

Solution

Kill runaway process: kill -9 PID
Nice a process to lower priority: renice +10 PID
Check for crypto miners: ps aux | grep -i "minerd\|xmrig\|cryptonight"
If it's a web process: check for slow database queries or infinite loops in code
Set CPU limits with cgroups or systemd: CPUQuota=50% in service file

cpuperformancetopprocess

High Permission denied errors on files/directories

Problem

bash: ./script.sh: Permission denied or cannot write to a directory you should own.

Diagnosis

ls -la /path/to/file       # check permissions and owner
whoami                     # current user
id                         # groups you're in
stat /path/to/file         # full permission details

Solution

Make script executable: chmod +x script.sh
Change ownership: sudo chown user:group /path
Fix web directory permissions: sudo chown -R www-data:www-data /var/www
Set correct directory perms: chmod 755 /dir (dirs need execute bit)
Add user to group: sudo usermod -aG groupname username (log out and back in)

permissionschmodchownlinux

Medium Service fails to start after reboot

Diagnosis

systemctl status servicename       # see error message
journalctl -u servicename -n 50    # last 50 log lines for service
journalctl -u servicename --since "1 hour ago"

Solution

Enable auto-start: sudo systemctl enable servicename
Fix config errors shown in journal output
Check dependencies: systemctl list-dependencies servicename
If port conflict: ss -tlnp | grep PORT — kill conflicting process
Reload daemon after editing service file: sudo systemctl daemon-reload

systemdservicebootjournalctl

Medium Out of memory — OOM killer killing processes

Diagnosis

free -h                           # see available RAM
dmesg | grep -i "oom\|killed"     # OOM killer log
ps aux --sort=-%mem | head -10    # top memory processes

Solution

Add swap (quick fix): sudo fallocate -l 2G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile
Make swap permanent: add /swapfile none swap sw 0 0 to /etc/fstab
Tune swappiness: sudo sysctl vm.swappiness=10
Find memory leaks — restart the leaking service nightly via cron
Upgrade RAM or move to larger server

memoryoomswapram

☁️

Cloud Problems

24 problems

High AWS bill unexpectedly high — cost spike

Problem

AWS/GCP/Azure bill 10x higher than expected. Often caused by forgotten resources or data transfer costs.

Solution

Go to AWS Cost Explorer → identify the service causing the spike
Common culprits: NAT Gateway data transfer, unused EC2 instances, S3 request spikes, Elastic IPs not attached
Set billing alarm: aws cloudwatch put-metric-alarm --alarm-name billing-alarm --metric-name EstimatedCharges
Enable AWS Budgets — get email before threshold reached
Use Reserved Instances or Savings Plans for predictable workloads (saves 40-70%)
Delete idle resources: use AWS Trusted Advisor to find them

awscostbillingcloud

High EC2 instance unreachable after security group change

Problem

Locked yourself out of EC2 — SSH and HTTP both blocked after mis-configuring security group.

Solution

Go to AWS Console → EC2 → Security Groups
Find the security group → Edit Inbound Rules
Add rule: Type SSH, Port 22, Source: My IP (or 0.0.0.0/0 temporarily)
For future: use AWS Systems Manager Session Manager — SSH without port 22
Never remove all rules in one batch — always add new rule before removing old

ec2security-groupawslockout

Medium S3 bucket public — data exposed

Problem

S3 bucket accidentally made public exposing files. Noticed via security scan or data breach.

Solution

Block public access immediately: AWS Console → S3 → Bucket → Permissions → Block all public access → Enable
Or via CLI: aws s3api put-public-access-block --bucket BUCKET --public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
Audit what was accessed: enable S3 server access logging and CloudTrail
Enable S3 Block Public Access at account level to prevent future mistakes

s3awsdata-exposuresecurity

Medium Kubernetes pod stuck in CrashLoopBackOff

Diagnosis

kubectl get pods                        # see status
kubectl describe pod POD_NAME          # events and error details
kubectl logs POD_NAME --previous       # logs from crashed container
kubectl logs POD_NAME -c CONTAINER     # specific container logs

Solution

Read crash logs — usually reveals the root cause immediately
Common causes: missing environment variables, wrong image, misconfigured liveness probe
Check resource limits — OOMKilled means pod ran out of memory: kubectl describe pod | grep -A5 Limits
Test image locally: docker run --rm IMAGE_NAME
Fix config then: kubectl rollout restart deployment/DEPLOYMENT_NAME

kubernetesk8scrashlooppods

⚙️

DevOps Problems

24 problems

High Docker container exits immediately after start

Diagnosis

docker ps -a                   # see exited containers
docker logs CONTAINER_ID       # see what it printed before dying
docker inspect CONTAINER_ID    # full config and exit code
docker run -it IMAGE /bin/sh   # run interactively to debug

Solution

Exit code 1 = application error — check logs for crash message
Exit code 137 = OOM killed — increase --memory limit
Missing CMD/ENTRYPOINT — add to Dockerfile: CMD ["python", "app.py"]
Missing env vars — add -e VAR=value or use --env-file .env
Permission issue on mounted volumes — check :z flag on SELinux systems

dockercontainerdevops

High CI/CD pipeline failing on every push

Problem

GitHub Actions / Jenkins pipeline red on every commit. Tests pass locally but fail in CI.

Solution

Read the full CI log — don't just see "failed", read WHY
Common: missing secrets — add to GitHub Settings → Secrets and variables
Environment mismatch — CI uses different Node/Python version: pin it in workflow node-version: '20'
Missing test dependencies: add install step before test step
Reproduce locally: act tool runs GitHub Actions locally
Port conflicts between parallel jobs: use dynamic ports or sequential jobs

ci-cdgithub-actionsjenkinspipeline

Medium Git merge conflicts blocking deployment

Solution

See all conflicted files: git status | grep "both modified"
Open each file — resolve between <<<<<<< and >>>>>>> markers
Use visual tool: git mergetool (opens vimdiff or configured tool)
After resolving: git add . && git commit
Prevent conflicts: merge main into feature branches frequently, keep PRs small

gitmergeconflictdevops

Medium Nginx 502 Bad Gateway error

Diagnosis

sudo tail -f /var/log/nginx/error.log   # see the actual error
systemctl status gunicorn              # is backend running?
curl http://localhost:8000             # test backend directly
ss -tlnp | grep 8000                  # is backend listening on expected port?

Solution

502 = nginx can't reach backend — start the backend service
Check proxy_pass port matches where backend actually runs
Increase timeouts if backend is slow: proxy_read_timeout 300; in nginx config
Check unix socket permissions if using socket instead of port
Reload after config fix: sudo nginx -t && sudo systemctl reload nginx

nginx502proxybackend

🗄️

Database Problems

24 problems

High MySQL queries extremely slow — full table scans

Diagnosis

EXPLAIN SELECT * FROM users WHERE email = '[email protected]'; -- Look for "type: ALL" = full table scan — needs index SHOW FULL PROCESSLIST; -- see running queries SHOW VARIABLES LIKE 'slow_query_log%'; -- enable slow query log

Solution

Add index on searched column: CREATE INDEX idx_email ON users(email);
Enable slow query log: SET GLOBAL slow_query_log = 1; SET GLOBAL long_query_time = 1;
Analyze with EXPLAIN — every query plan should show index use, not ALL
Use composite index for multi-column WHERE: CREATE INDEX idx_name ON table(col1, col2);
Run ANALYZE TABLE to update statistics: ANALYZE TABLE users;

mysqlindexperformanceslow-query

High Database connection pool exhausted

Problem

"Too many connections" error. Application throws connection errors under load.

Diagnosis

SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';
SHOW FULL PROCESSLIST;   -- see all open connections

Solution

Increase max connections: SET GLOBAL max_connections = 500;
Add connection pooling: use PgBouncer (PostgreSQL) or ProxySQL (MySQL)
Find connection leaks in code — ensure connections are closed after use
Kill idle connections: KILL CONNECTION_ID;
Set wait_timeout: SET GLOBAL wait_timeout = 60;

mysqlpostgresqlconnectionspool

High Accidentally deleted important data — no backup

Problem

DELETE FROM users; or DROP TABLE orders; without WHERE clause or backup.

Solution

Stop writes immediately to prevent overwriting deleted data
MySQL: check binary logs for recovery: mysqlbinlog --start-datetime="2026-07-03 08:00:00" /var/lib/mysql/mysql-bin.000001 | mysql -u root -p
PostgreSQL: if WAL archiving was on, use Point-in-Time Recovery
Check if any replica has the data — promote replica before it syncs the delete
Lesson: always use transactions, always have backups, test restores regularly

data-recoverybackupbinlogdisaster

Medium Deadlock errors in database logs

Diagnosis

SHOW ENGINE INNODB STATUS\G   -- MySQL deadlock info
-- Look for "LATEST DETECTED DEADLOCK" section

Solution

Always access tables in the same order across all transactions
Keep transactions short — don't hold locks while doing non-DB work
Use lower isolation level if possible: READ COMMITTED instead of REPEATABLE READ
Add retry logic in application for deadlock errors (error code 1213 in MySQL)
Use SELECT ... FOR UPDATE only when you'll actually update

deadlocktransactionsmysqlinnodb

🖥️

Hardware Problems

22 problems

High Server overheating — thermal throttling

Diagnosis

sensors                        # CPU/GPU temps (install: apt install lm-sensors)
sudo sensors-detect            # first-time setup
cat /sys/class/thermal/thermal_zone*/temp  # raw temp in millidegrees
dmesg | grep -i "thermal\|throttl"        # kernel thermal events

Solution

Clean dust from fans and heatsinks — compressed air every 6-12 months
Replace thermal paste on CPU — degrades after 3-5 years
Ensure proper airflow — hot exhaust not recirculating back as intake
Check fan speeds: sensors — if fans at 0 RPM they've failed
For servers: check data center ambient temp and cooling unit function

thermalcpucoolinghardware

High Hard drive failing — S.M.A.R.T. errors

Diagnosis

sudo smartctl -a /dev/sda        # full SMART report
sudo smartctl -H /dev/sda        # overall health assessment
dmesg | grep -i "error\|ata\|I/O"  # kernel disk errors

Solution

If SMART shows "FAILED" — backup ALL data immediately, drive will die soon
Watch reallocated_sector_ct — any non-zero value is serious
Set up monitoring: sudo apt install smartmontools && sudo systemctl enable smartd
Check disk health weekly: add to cron smartctl -H /dev/sda | mail -s "Disk Health" [email protected]
Replace drive before it fails — RAID is not a backup

smartdiskhddfailure

Medium RAM causing random crashes and blue screens

Diagnosis

# Linux: run memtest86 from boot
# Or: sudo apt install memtester
sudo memtester 1G 1              # test 1GB of RAM, 1 pass
# Windows: mdsched.exe → Restart and check for problems

Solution

Run memtest86+ overnight — any errors = faulty RAM
If multiple sticks: remove one at a time to isolate the bad stick
Reseat RAM sticks — remove and firmly push back in
Try different RAM slots — could be motherboard slot fault
Replace faulty DIMM — RAM is relatively inexpensive

rammemorymemtestcrash

📋

Compliance Problems

20 problems

High GDPR violation — user data processed without consent

Problem

Tracking users, storing personal data, or sending emails without proper GDPR consent mechanism.

Solution

Add cookie consent banner before loading any tracking (GA4, Meta Pixel, etc.)
Create and publish Privacy Policy and Cookie Policy pages
Implement consent management platform (CMP): Cookiebot, OneTrust, or open-source Klaro
Document your legal basis for processing: consent, legitimate interest, contract, etc.
Add "Delete My Data" mechanism for users to exercise right to erasure
Appoint DPO if processing sensitive data at scale

gdprprivacyconsentcompliance

High PCI DSS — storing plain-text card data

Problem

Card numbers, CVVs, or full track data stored in database in plain text — massive PCI violation.

Solution

Delete all stored card data immediately
Use a payment processor (Stripe, Braintree) — never touch raw card data
Tokenize: processor gives you a token to charge later, card number never hits your server
Scope reduction: if you don't store/transmit card data, PCI scope drops to SAQ A (simplest)
Never store CVV under any circumstances — prohibited by PCI DSS

pci-dsspaymentscompliancestripe

Medium Audit log gaps — missing activity records

Problem

Security audit finds incomplete logs — who accessed what data and when is not tracked.

Solution

Enable database audit logging: MySQL audit plugin or PostgreSQL pgaudit extension
Log all admin actions: who changed what config, when, from which IP
Ship logs to immutable storage: attacker can't delete what they can't reach
Set retention: SOC 2 requires 1 year, PCI DSS requires 1 year minimum
Use structured logging (JSON) for easy querying and alerting

audit-logssoc2compliancemonitoring