Roshka Dev Team: December 2021

TL;DR: NGINX was not launching correctly. Since no logs were being written by the process, had to use strace to debug what was going on.

There was a weird thing going on with one of our NGINX servers. The sequence of events was like this:

1. Our server rebooted (after many, many months of uptime).

2. After reboot, NGINX was running but not responding to requests

Even if I curled localhost like this, nothing happened:

root@amy:/tmp# curl -v http://localhost
* Rebuilt URL to: http://localhost/
* Hostname was NOT found in DNS cache
* Trying ::1...
* Connected to localhost (::1) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: localhost
> Accept: */*
>

Curl was stacked at that part.

3. Checked error and access logs. Nothing was being written in logs after reboot

That was weird...

4. Did a ps to check if process was running at all. And it was. But realized that no NGINX workers were spawned after launch. How come?

This is what the output looked like:

* Connection #0 to host localhost left intact
root@amy:/tmp# ps aux | grep nginx
root 880 0.0 0.1 43424 5968 ? Ss 08:40 0:00 nginx: master process /usr/local/nginx/sbin/nginx -g daemon on; master_process on;
root@amy:/tmp#

5. Looked at journalctl's output to see if anything was going on with the service.

This was all I had:

Dec 31 08:40:11 amy systemd[1]: Stopping A high performance web server and a reverse proxy server...
Dec 31 08:40:11 amy systemd[1]: Stopped A high performance web server and a reverse proxy server.
Dec 31 08:40:15 amy systemd[1]: Starting A high performance web server and a reverse proxy server...
Dec 31 08:40:15 amy systemd[1]: Started A high performance web server and a reverse proxy server.

Nothing else.

6. So I had to resort to heavy machinery: strace

Launched strace attaching it to NGINX's using its PID.

# strace -p 513 -s 10000 -v -f

On a different terminal reloaded NGINX

# systemctl reload nginx

Then strace's output gave me the reason no workers were being spawned.

[pid 844] prctl(PR_SET_DUMPABLE, 1) = 0
[pid 844] chdir("/tmp/cores") = -1 ENOENT (No such file or directory)
[pid 844] write(16, "2021/12/31 08:38:08 [alert] 844#0: chdir(\"/tmp/cores\") failed (2: No such file or directory)\n", 93) = 93
[pid 846] fstat(20, <unfinished ...>
[pid 844] exit_group(2) = ?
[pid 844] +++ exited with 2 +++

7. Turned out I directory I configured a long time ago to evaluate a SIGSEGV I was having, was deleted on reboot so workers were failing to spawn. After that, created the directory again and NGINX was responding to my requests once again.

====

End of (sad) story. Half an hour I will never get back.

Happy New Year!

Roshka Dev Team

Friday, December 31, 2021

NGINX was not responding after restart