Friday, December 31, 2021

NGINX was not responding after restart

TL;DR: NGINX was not launching correctly. Since no logs were being written by the process, had to use strace to debug what was going on.

There was a weird thing going on with one of our NGINX servers. The sequence of events was like this:

1. Our server rebooted (after many, many months of uptime). 

2. After reboot, NGINX was running but not responding to requests

Even if I curled localhost like this, nothing happened:

root@amy:/tmp# curl -v http://localhost
* Rebuilt URL to: http://localhost/
* Hostname was NOT found in DNS cache
*   Trying ::1...
* Connected to localhost (::1) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: localhost
> Accept: */*

Curl was stacked at that part.

3. Checked error and access logs. Nothing was being written in logs after reboot

That was weird...

4. Did a ps to check if process was running at all. And it was. But realized that no NGINX workers were spawned after launch. How come?

This is what the output looked like:

* Connection #0 to host localhost left intact
root@amy:/tmp# ps aux | grep nginx
root       880  0.0  0.1  43424  5968 ?        Ss   08:40   0:00 nginx: master process /usr/local/nginx/sbin/nginx -g daemon on; master_process on;

5. Looked at journalctl's output to see if anything was going on with the service.

This was all I had:

Dec 31 08:40:11 amy systemd[1]: Stopping A high performance web server and a reverse proxy server...
Dec 31 08:40:11 amy systemd[1]: Stopped A high performance web server and a reverse proxy server.
Dec 31 08:40:15 amy systemd[1]: Starting A high performance web server and a reverse proxy server...
Dec 31 08:40:15 amy systemd[1]: Started A high performance web server and a reverse proxy server.

Nothing else.

6. So I had to resort to heavy machinery: strace

Launched strace attaching it to NGINX's using its PID.

# strace -p 513 -s 10000 -v -f

On a different terminal reloaded NGINX

# systemctl reload nginx

Then strace's output gave me the reason no workers were being spawned.

[pid   844] prctl(PR_SET_DUMPABLE, 1)   = 0
[pid   844] chdir("/tmp/cores")         = -1 ENOENT (No such file or directory)
[pid   844] write(16, "2021/12/31 08:38:08 [alert] 844#0: chdir(\"/tmp/cores\") failed (2: No such file or directory)\n", 93) = 93
[pid   846] fstat(20,  <unfinished ...>
[pid   844] exit_group(2)               = ?
[pid   844] +++ exited with 2 +++

7. Turned out I directory I configured a long time ago to evaluate a SIGSEGV I was having, was deleted on reboot so workers were failing to spawn. After that, created the directory again and NGINX was responding to my requests once again.


End of (sad) story. Half an hour I will never get back. 

Happy New Year!

