Resource Monitoring for Data Pipelines – Codango® / Codango.Com

When running data pipelines—especially in production—resource monitoring is critical to prevent slowdowns, crashes, or system-wide failures. Simple Linux command-line tools like top, htop, df -h, and free -h provide real-time visibility into system health and help you catch issues before they escalate.

1. Monitoring CPU & Processes: `top` and `htop`

`top` (Built-in, lightweight)

The top command gives a live view of system processes and CPU usage.

Shows:

CPU utilization (user, system, idle time)
Running processes and their CPU/memory consumption

Why it matters for pipelines:

Identify CPU bottlenecks during heavy transformations (e.g., Spark jobs, ETL scripts)
Detect runaway processes consuming excessive CPU
Spot when multiple pipelines overload the system

Tip: Press P inside top to sort by CPU usage.

`htop` (Enhanced, user-friendly)

htop is an improved version of top with a more intuitive interface.

Features:

Color-coded CPU, memory, and swap usage
Easy process management (kill, renice)
Tree view of processes (great for pipeline dependencies)

Pipeline use cases:

Visualize parallel jobs in distributed pipelines
Quickly terminate stuck or zombie tasks
Monitor thread-level activity in real time

2. Monitoring Memory Usage: `free -h`

The free -h command shows memory usage in a human-readable format.

Key metrics:

Used
Free
Buffers/cache
Swap usage
Available

Example use:

If your data pipeline loads large datasets into memory (e.g., Pandas, Spark), watch the available memory
If it drops too low, the system may start swapping, drastically slowing performance

Best practice:

Ensure pipelines don’t consume all RAM—leave headroom for the OS and other services

3. Monitoring Disk Space: `df -h`

The df -h command displays disk usage across mounted filesystems.

Shows:

Total, used, and available disk space
Usage percentage per filesystem

Data pipelines often generate:

Temporary files
Logs
Intermediate datasets

If disk fills up:

Jobs may fail unexpectedly
Databases or services can crash

Common risk:

A pipeline writing large intermediate files (e.g., CSV/Parquet) can silently fill up disk → causing job failure or system instability.

Tip:

Watch for partitions approaching 90–100% usage
Clean up temp directories or rotate logs regularly

Preventing Production Failures

By combining these tools, you can proactively protect your system:

High CPU usage (top/htop)
Indicates inefficient code or too many parallel jobs
Low available memory (free -h)
Risk of crashes or heavy swapping
High disk usage (df -h)
Risk of failed writes and system instability

Practical Workflow for Data Engineers

Start your pipeline
Open another terminal and run:

htop → monitor CPU + processes
watch free -h → track memory over time
watch df -h → monitor disk growth

Look for abnormal spikes or steady resource exhaustion
Adjust: Batch sizes, Parallelism, Memory allocation

Key Takeaways

These tools are lightweight, fast, and available on most Linux systems
They provide real-time insights into system health
Regular monitoring helps: Prevent crashes, Optimize performance and Ensure stable production pipelines

Conclusion

Without proper monitoring:

Pipelines may crash unexpectedly
Systems can become unresponsive
Other production services may be impacted

With these tools:

You gain early warning signals
You can debug performance issues faster
You ensure stable, reliable data processing

1. Monitoring CPU & Processes: top and htop

top (Built-in, lightweight)

htop (Enhanced, user-friendly)

2. Monitoring Memory Usage: free -h

3. Monitoring Disk Space: df -h