Friday, December 31, 2021

NGINX was not responding after restart

TL;DR: NGINX was not launching correctly. Since no logs were being written by the process, had to use strace to debug what was going on.

There was a weird thing going on with one of our NGINX servers. The sequence of events was like this:

1. Our server rebooted (after many, many months of uptime). 

2. After reboot, NGINX was running but not responding to requests

Even if I curled localhost like this, nothing happened:

root@amy:/tmp# curl -v http://localhost
* Rebuilt URL to: http://localhost/
* Hostname was NOT found in DNS cache
*   Trying ::1...
* Connected to localhost (::1) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: localhost
> Accept: */*

Curl was stacked at that part.

3. Checked error and access logs. Nothing was being written in logs after reboot

That was weird...

4. Did a ps to check if process was running at all. And it was. But realized that no NGINX workers were spawned after launch. How come?

This is what the output looked like:

* Connection #0 to host localhost left intact
root@amy:/tmp# ps aux | grep nginx
root       880  0.0  0.1  43424  5968 ?        Ss   08:40   0:00 nginx: master process /usr/local/nginx/sbin/nginx -g daemon on; master_process on;
root@amy:/tmp# 

5. Looked at journalctl's output to see if anything was going on with the service.

This was all I had:

Dec 31 08:40:11 amy systemd[1]: Stopping A high performance web server and a reverse proxy server...
Dec 31 08:40:11 amy systemd[1]: Stopped A high performance web server and a reverse proxy server.
Dec 31 08:40:15 amy systemd[1]: Starting A high performance web server and a reverse proxy server...
Dec 31 08:40:15 amy systemd[1]: Started A high performance web server and a reverse proxy server.

Nothing else.

6. So I had to resort to heavy machinery: strace

Launched strace attaching it to NGINX's using its PID.

# strace -p 513 -s 10000 -v -f

On a different terminal reloaded NGINX

# systemctl reload nginx

Then strace's output gave me the reason no workers were being spawned.

[pid   844] prctl(PR_SET_DUMPABLE, 1)   = 0
[pid   844] chdir("/tmp/cores")         = -1 ENOENT (No such file or directory)
[pid   844] write(16, "2021/12/31 08:38:08 [alert] 844#0: chdir(\"/tmp/cores\") failed (2: No such file or directory)\n", 93) = 93
[pid   846] fstat(20,  <unfinished ...>
[pid   844] exit_group(2)               = ?
[pid   844] +++ exited with 2 +++

7. Turned out I directory I configured a long time ago to evaluate a SIGSEGV I was having, was deleted on reboot so workers were failing to spawn. After that, created the directory again and NGINX was responding to my requests once again.

====

End of (sad) story. Half an hour I will never get back. 

Happy New Year!