Search

[TroubleShooting] Jenkins μž₯μ•  λŒ€μ‘

Date
2025/11/18
Category
Devops
Tag
TroubleShooting
CI/CD

μž‘μ„±μΌ: 2025.11.18

κ΄€λ ¨ μ„œλΉ„μŠ€: Jenkins CI / Docker Build Pipeline

μž₯μ•  μœ ν˜•: EBS I/O μ„±λŠ₯ 고갈둜 μΈν•œ Jenkins Hang μƒνƒœ

Β κ΄€λ ¨ μž₯μ•  λ³΄κ³ μ„œ: [TroubleShooting] Jenkins μž₯μ•  뢄석

1. κ°œμš”

λ³Έ λ¬Έμ„œλŠ” 2025.11.17 23:50 ~ 00:10 λ°œμƒν•œ Jenkins 쀑단 μž₯애에 λŒ€ν•΄ μž₯μ•  λ°œμƒ 이후 μˆ˜ν–‰λœ μ‘°μΉ˜μ™€ μ„±λŠ₯ κ°œμ„  κ²°κ³Όλ₯Ό κΈ°λ‘ν•œ λ³΄κ³ μ„œμ΄λ‹€.
μž₯μ• μ˜ κ·Όλ³Έ 원인은 Jenkins EC2 μΈμŠ€ν„΄μŠ€κ°€ μ‚¬μš©ν•˜λŠ” EBS(gp2)의
BurstBalance 고갈 β†’ I/O Stall β†’ OS Hang
μ΄μ—ˆμœΌλ©°, 이에 따라 Jenkins UI, SSH, Docker λͺ¨λ‘ μ‘λ‹΅ν•˜μ§€ λͺ»ν•˜λŠ” μƒνƒœκ°€ λ°œμƒν–ˆλ‹€.
이 λ¬Έμ„œλŠ” μž₯μ•  뢄석 μš”μ•½ β†’ μˆ˜ν–‰ν•œ 쑰치 β†’ μ„±λŠ₯ 비ꡐ μˆœμ„œλ‘œ κ΅¬μ„±ν•œλ‹€.

2. μž₯μ•  원인 μš”μ•½

Β Root Cause

β€’
Jenkins EC2의 Docker Root(/var/lib/docker) κ°€ EC2의 μ£Ό EBS λ³Όλ₯¨(gp2) 에 μœ„μΉ˜
β€’
Docker build & image layer μž‘μ—… 쀑 ReadOps/WriteOps 폭증
β€’
gp2 νŠΉμ„±μƒ BurstCredit 고갈 β†’ BurstBalance 0%
β€’
VolumeTotalReadTime/WriteTime μ΅œλŒ€ 119초, QueueLength 급증
β†’ λ””μŠ€ν¬ 응닡이 μ˜€μ§€ μ•Šμ•„ OS λ ˆλ²¨μ—μ„œ block
β€’
결과적으둜:
β—¦
Jenkins UI: 504 Gateway Timeout
β—¦
SSH: 접속 λΆˆκ°€
β—¦
Docker: build 쀑단
β—¦
EC2 Status Check: 정상 (ν•˜λ“œμ›¨μ–΄/λ„€νŠΈμ›Œν¬ λ¬Έμ œλŠ” μ•„λ‹˜)

μ£Ό 원인: μŠ€ν† λ¦¬μ§€ I/O 병λͺ©(EBS μ„±λŠ₯ λΆ€μ‘±)

BurstBalance
VolumeTotalReadTime
VolumeQueueLength

3. κΈ°μ‘΄ λ³Όλ₯¨ ꡬ성 vs λ³€κ²½ ν›„ λ³Όλ₯¨ ꡬ성

κΈ°μ‘΄ κ΅¬μ‘°λŠ” OS(EBS 20GB) μœ„μ— Dockerκ°€ κ³΅μ‘΄ν•˜μ—¬ Docker build μž‘μ—…μ΄ 전체 파일 μ‹œμŠ€ν…œμ— 병λͺ© ν˜„μƒμ„ κ°€μ Έμ˜€λŠ” κ΅¬μ‘°μ˜€λ‹€. 이λ₯Ό λ‹€μŒκ³Ό 같이 κ°œμ„ ν•˜μ˜€λ‹€.

Before

β€’
Root EBS (gp2, 20GB)
β—¦
OS
β—¦
/var/lib/docker (Docker 이미지/λ ˆμ΄μ–΄) β†’ μž₯μ•  지점
β€’
Jenkins Data EBS (gp2, 20GB)
β—¦
/var/jenkins_home (Job μ„€μ •, ν”ŒλŸ¬κ·ΈμΈ, λΉŒλ“œ νžˆμŠ€ν† λ¦¬ λ“± Jenkins state)
β—¦
Jenkins 데이터 보쑴용 λ³Όλ₯¨

After

β€’
Root EBS (gp2, 20GB)
β€’
Jenkins Data EBS (gp3, 50GB)
β—¦
/var/jenkins_home (λ³€κ²½ μ—†μŒ)
β€’
μƒˆ Docker μ „μš© EBS (gp3, 50GB, 6000 IOPS, Throughput 125MB/s)
β—¦
/var/lib/docker μ „μš©

4. μž₯μ•  λŒ€μ‘ 및 쑰치 λ‚΄μ—­

μž₯μ•  λ°œμƒ 직후 Jenkins의 I/O 병λͺ©μ„ ν•΄κ²°ν•˜κΈ° μœ„ν•΄ λ‹€μŒκ³Ό 같은 쑰치λ₯Ό μˆ˜ν–‰ν•˜μ˜€λ‹€.

4.1 Docker μ „μš© EBS(gp3) μΆ”κ°€ 및 Attach

1) Terraform: Docker μ „μš© EBS 생성

resource "aws_ebs_volume" "docker" { count = length(local.docker_existing_ids) > 0 ? 0 : 1 availability_zone = var.az size = var.docker_ebs_size # 50 type = var.docker_ebs_type # gp3 iops = var.docker_ebs_iops # 6000 throughput = var.docker_ebs_throughput # 125 tags = { Name = "${var.prefix}-docker-data" } lifecycle { prevent_destroy = true } }
HCL
볡사

2) Terraform: EC2에 Docker λ³Όλ₯¨ attach

resource "aws_volume_attachment" "docker" { device_name = "/dev/sdg" volume_id = local.docker_volume_id instance_id = aws_instance.this.id force_detach = true lifecycle { ignore_changes = [volume_id] } }
HCL
볡사

4.2 Userdataλ₯Ό ν†΅ν•œ /var/lib/docker λ§ˆμ΄κ·Έλ ˆμ΄μ…˜

λΆ€νŒ… μ‹œ μžλ™μœΌλ‘œ
1.
Docker μ „μš© EBSλ₯Ό EXT4둜 포맷
2.
/var/lib/docker에 마운트
3.
/etc/fstab에 등둝
4.
Docker μ„œλΉ„μŠ€ μž¬μ‹œμž‘

1) Userdata (λΆ€λΆ„ 발췌)

DOCKER_VOL_ID="${docker_volume_id}" DOCKER_MNT="/var/lib/docker" DOCKER_DEVICE="" for i in $(seq 1 120); do if DOCKER_DEVICE=$(find_nvme_device "$DOCKER_VOL_ID"); then break fi echo "Retry $i/120: Docker volume not found..." sleep 2 done # 1) 포맷 if ! blkid "$DOCKER_DEVICE" >/dev/null 2>&1; then mkfs.ext4 "$DOCKER_DEVICE" fi # 2) docker 쀑지 systemctl stop docker || true # 3) 마운트 및 fstab 등둝 mkdir -p "$DOCKER_MNT" if ! mount | grep -q "$DOCKER_MNT"; then mount "$DOCKER_DEVICE" "$DOCKER_MNT" echo "$DOCKER_DEVICE $DOCKER_MNT ext4 defaults,nofail 0 2" >> /etc/fstab fi # 4) docker μž¬μ‹œμž‘ systemctl start docker
Shell
볡사
이둜써 Docker I/OλŠ” μ „λΆ€ Docker EBS(gp3)둜만 κ°€κ³ , Root λ””μŠ€ν¬ 및 Jenkins Data λ””μŠ€ν¬μ—λŠ” 거의 영ν–₯이 κ°€μ§€ μ•Šκ²Œ λœλ‹€.

4.3 Jenkins Data λ³Όλ₯¨ μœ μ§€

β€’
기쑴의 Jenkins Data λ³Όλ₯¨(/mnt/jenkins_data β†’ μ»¨ν…Œμ΄λ„ˆ λ‚΄λΆ€ /var/jenkins_home)은
μž₯μ•  이전뢀터 μ‘΄μž¬ν•˜λ˜ λ³Όλ₯¨μ΄λ©°, 이번 μž‘μ—…μ—μ„œλ„ 포맷/μ‚­μ œ/λ³€κ²½ 없이 κ·ΈλŒ€λ‘œ μœ μ§€ν•˜μ˜€λ‹€.
β€’
Userdataμ—μ„œλŠ” Jenkins Volume도 λ§ˆμ°¬κ°€μ§€λ‘œ NVMe λ§€ν•‘ ν›„ /mnt/jenkins_data에 마운트만 μˆ˜ν–‰ν•œλ‹€.
λ”°λΌμ„œ Jenkins Job μ„€μ •, ν”ŒλŸ¬κ·ΈμΈ, λΉŒλ“œ νžˆμŠ€ν† λ¦¬ 등은 κ·ΈλŒ€λ‘œ μœ μ§€λ˜κ³ , 이번 쑰치둜 인해 Jenkins 데이터 손싀은 λ°œμƒν•˜μ§€ μ•Šμ•˜λ‹€.

5. μ„±λŠ₯ 및 μ•ˆμ •μ„± κ°œμ„  κ²°κ³Ό

5.1 EBS μ§€ν‘œ

μ§€ν‘œ
μž₯μ•  λ‹Ήμ‹œ (gp2 + Docker 곡유)
쑰치 ν›„ (Docker μ „μš© gp3)
κ°œμ„ 
VolumeTotalReadTime
μ΅œλŒ€ 119초
평균 5~9ms μˆ˜μ€€
μ•½ 13,000λ°° κ°œμ„ 
VolumeTotalWriteTime
μˆ˜μ‹­ 초
λͺ‡ ms μˆ˜μ€€
정상 λ²”μœ„
VolumeQueueLength
10~30
0 ~ 0.2
λŒ€κΈ°μ—΄ ν•΄μ†Œ
BurstBalance
0%
gp3λŠ” ν¬λ ˆλ”§ κ°œλ… μ—†μŒ
ꡬ쑰적 ν•΄κ²°
ReadOps/WriteOps
κΈ‰κ²©ν•œ 슀파이크 + κΈ΄ Tail
짧은 슀파이크 ν›„ λ°”λ‘œ 0
I/O 정체 ν•΄μ†Œ

VolumeReadBytes / WriteBytes

VolumeReadOps / WriteOps

VolumeTotalReadTime / WriteTime

VolumeQueueLength

5.2 Jenkins Pipeline μ„±λŠ₯ κ°œμ„ 

ν•­λͺ©
λ³€κ²½ μ „
λ³€κ²½ ν›„
Docker Build & Push 단계
μž₯μ•  λ°œμƒμœΌλ‘œ μΈ‘μ • λΆˆκ°€
μ•½ 1λΆ„ 2초
전체 Pipeline μ‹€ν–‰ μ‹œκ°„
μž₯μ•  λ°œμƒμœΌλ‘œ μΈ‘μ • λΆˆκ°€
1λΆ„ 35초 λ‚΄μ™Έ μ•ˆμ •μ μœΌλ‘œ μ™„λ£Œ
Jenkins UI λ°˜μ‘
λΉŒλ“œ 쀑 맀우 느림 / νƒ€μž„μ•„μ›ƒ
λΉŒλ“œ 쀑에도 UI μ¦‰μ‹œ 응닡

Before

After

6. κ²°λ‘ 

β€’
Jenkins μž₯μ• μ˜ 원인은 EC2 μΈμŠ€ν„΄μŠ€λ‚˜ Jenkins Data Volume이 μ•„λ‹ˆλΌ, Dockerκ°€ 곡유 Root EBS(gp2)에 λͺ°λ¦° κ΅¬μ‘°μ˜€λ‹€.
β€’
이미 μ‘΄μž¬ν•˜λ˜ Jenkins Data μ „μš© λ³Όλ₯¨μ€ κ·ΈλŒ€λ‘œ μœ μ§€ν•˜λ©΄μ„œ,
β€’
Docker μ „μš© gp3 EBSλ₯Ό μΆ”κ°€ν•˜κ³  /var/lib/dockerλ₯Ό λΆ„λ¦¬ν•¨μœΌλ‘œμ¨,
β—¦
I/O 병λͺ©μ΄ μ™„μ „νžˆ μ œκ±°λ˜μ—ˆκ³ 
β—¦
Jenkins Pipeline의 μ•ˆμ •μ„±κ³Ό μ„±λŠ₯이 λͺ¨λ‘ 크게 ν–₯μƒλ˜μ—ˆλ‹€.
이번 μ‘°μΉ˜λŠ”
β€œλΉŒλ“œ νŠΈλž˜ν”½μ€ Docker λ³Όλ₯¨μœΌλ‘œ, Jenkins μƒνƒœλŠ” Data λ³Όλ₯¨μœΌλ‘œ, OSλŠ” Root λ³Όλ₯¨μœΌλ‘œβ€
λΌλŠ” λͺ…ν™•ν•œ μ—­ν•  뢄리λ₯Ό 톡해, μŠ€ν† λ¦¬μ§€ λ ˆλ²¨μ—μ„œ Jenkins CI의 μ•ˆμ •μ„±μ„ 높인 μž‘μ—…μœΌλ‘œ 평가할 수 μžˆλ‹€.