When running data pipelines—especially in production—resource monitoring is critical to prevent slowdowns, crashes, or system-wide failures. Simple Linux command-line tools like top, htop, df -h, and free -h provide real-time visibility into system health and help you catch issues before they escalate.
1. Monitoring CPU & Processes: top and htop
top (Built-in, lightweight)
The top command gives a live view of system processes and CPU usage.
Shows:
- CPU utilization (user, system, idle time)
- Running processes and their CPU/memory consumption
Why it matters for pipelines:
- Identify CPU bottlenecks during heavy transformations (e.g., Spark jobs, ETL scripts)
- Detect runaway processes consuming excessive CPU
- Spot when multiple pipelines overload the system
Tip: Press P inside top to sort by CPU usage.
htop (Enhanced, user-friendly)
htop is an improved version of top with a more intuitive interface.
Features:
- Color-coded CPU, memory, and swap usage
- Easy process management (kill, renice)
- Tree view of processes (great for pipeline dependencies)
Pipeline use cases:
- Visualize parallel jobs in distributed pipelines
- Quickly terminate stuck or zombie tasks
- Monitor thread-level activity in real time
2. Monitoring Memory Usage: free -h
The free -h command shows memory usage in a human-readable format.
Key metrics:
- Used
- Free
- Buffers/cache
- Swap usage
- Available
Example use:
- If your data pipeline loads large datasets into memory (e.g., Pandas, Spark), watch the available memory
- If it drops too low, the system may start swapping, drastically slowing performance
Best practice:
- Ensure pipelines don’t consume all RAM—leave headroom for the OS and other services
3. Monitoring Disk Space: df -h
The df -h command displays disk usage across mounted filesystems.
Shows:
- Total, used, and available disk space
- Usage percentage per filesystem
Data pipelines often generate:
- Temporary files
- Logs
- Intermediate datasets
If disk fills up:
- Jobs may fail unexpectedly
- Databases or services can crash
Common risk:
A pipeline writing large intermediate files (e.g., CSV/Parquet) can silently fill up disk → causing job failure or system instability.
Tip:
- Watch for partitions approaching 90–100% usage
- Clean up temp directories or rotate logs regularly
Preventing Production Failures
By combining these tools, you can proactively protect your system:
-
High CPU usage (
top/htop)
Indicates inefficient code or too many parallel jobs -
Low available memory (
free -h)
Risk of crashes or heavy swapping -
High disk usage (
df -h)
Risk of failed writes and system instability
Practical Workflow for Data Engineers
- Start your pipeline
- Open another terminal and run:
htop → monitor CPU + processes
watch free -h → track memory over time
watch df -h → monitor disk growth
- Look for abnormal spikes or steady resource exhaustion
- Adjust: Batch sizes, Parallelism, Memory allocation
Key Takeaways
- These tools are lightweight, fast, and available on most Linux systems
- They provide real-time insights into system health
- Regular monitoring helps: Prevent crashes, Optimize performance and Ensure stable production pipelines
Conclusion
Without proper monitoring:
- Pipelines may crash unexpectedly
- Systems can become unresponsive
- Other production services may be impacted
With these tools:
- You gain early warning signals
- You can debug performance issues faster
- You ensure stable, reliable data processing




