Remember to troubleshoot and fix the problem of frequent java processes hanging up

Preface

Recently, a java service process in the business department will suddenly hang up for no reason, and then this service will generate a bunch of logs similar to hs_err_pid19287.log. The person in charge of the business department sent me the hs_err_pidxxx log and asked me to help check the problem. This article will review how I helped the business department troubleshoot problems.

Troubleshooting process

First, the log of hs_err_pidxxx shows the following content:

I asked the business department to configure ulimit. Specific steps are as follows

vim /etc/security/limits.conf

# 在最后追加
* soft nofile 327680
* hard nofile 327680

However, the person in charge of the business department reported to me that they had already added it, but it didn’t work.

Then continue to analyze the log content of hs_err_pidxxx

When I saw that the memory of the new generation was 100%, I asked the business person in charge if the JVM memory setting was relatively small. The feedback I got was that they had just recently expanded the memory and made corresponding JVM settings. The memory must be enough. And the program log does not have oom related log information.

So continue to analyze the log content of hs_err_pidxxx,

Looking at the large number of thread_blocked, I felt that the problem was about to be fixed, so I told the business manager that your code might be blocked. The business manager said that this service has been running for many years, and other machines are fine. If there is this problem , Normally it has been exposed long ago.

From this hs_err_pidxxx log, that’s all I can get. Watching the business manager’s expression change from expectation to dullness, I felt that he was trusting someone else. Later I told him, otherwise you upgrade jdk to a small version. In fact, I was just testing. After all, while upgrading jdk brings benefits, it may also bring risks, especially in projects that have been running for many years. Unexpectedly, the business manager replied that this was exactly what he meant.

Later, the business person in charge used the problematic machine to perform a jdk upgrade, and the matter came to an end for the time being.

Question follow-up

Later, the architect of the same department mentioned the issue of the business department to me during a meal, and I discovered that after the business department upgraded jdk, it still didn’t work. So the business leader approached the department’s architect for help. After knowing this, I took the initiative to go to the head of the business department and asked him if the problem had been solved. After getting a negative answer from him.

In order to be responsible to the end, I first asked them for the messages log of their host machine. It is located in /var/log/messages. See the following information

There is an abrt-server in the log. Let’s do some science here.

What is abrt-server

abrt is an error reporting and tracking tool in the centos operating system. It can automatically collect application and system error information and generate error reports. When an error occurs in the system, abrt collects relevant information, such as error messages, stack traces, core dumps, etc. It generates an error report containing this information and other useful debugging information.

Its records are saved in the kernel core file. As time goes by, the core file will continue to grow in size and occupy disk space. We can use abrt-cli list to confirm the process and trigger time corresponding to the core status. And delete it through abrt-cli rm [file package]

Example:

abrt-cli rm /var/spool/abrt/oops-2022-09-27-14:22:55-13596-0

Back to the topic, we pass the contents of /var/log/messages

After searching for information, I learned that this error was caused by the inability to create the ccpp file . But is this the reason why the java process frequently hangs up? So we took this step and compared the time point when the ccpp file cannot be created with the generated hs_err_pidxxx time point.

The time points basically match, and there is also a paragraph in /var/log/messages

Executable '/usr/local/tomcat/jdk1.8.0_291/bin/java' doesn't belong to any package and ProcessUnpackaged is set to 'no'

After confirming with the business leader, this jdk is indeed the jdk currently used by this business. In summary, it can be basically determined that it is due to the inability to create the ccpp file, which is one of the reasons why the java process of this business frequently hangs up.

How to fix

Method 1: Change ProcessUnpackaged to yes

The meaning of this parameter is that ABRT will identify non-rpm installation programs (such as source code packages, etc.) as unpackaged programs, and will generate relevant warning and error logs, thus better catching bugs in some programs. If no, ABRT will not track and report crash information that occurs in unpackaged applications, but will only track crashes for existing packages. Therefore, using the yes option can expand the scope of ABRT and increase the number of exception programs captured, but it may also cause some false positives.

sed -i 's/ProcessUnpackaged = no/ProcessUnpackaged = yes/g' /etc/abrt/abrt-action-save-package-data.conf&& systemctl restart abrtd.service

Or you can do it step by step

vim /etc/abrt/abrt-action-save-package-data.conf
ProcessUnpackaged = yes  
systemctl restart abrtd.service

However, there is another detail to pay attention to here. The default maximum size of the core dump file is 5000. We can adjust it according to the actual situation, or set it to 0. 0 means that there is no limit on the size of the core dump file, but set it to 0. There is a risk that the disk space may be full because core files are normally relatively large.

You can modify the MaxCrashReportsSize parameter through the following configuration

vim /etc/abrt/abrt.conf
 
MaxCrashReportsSize = 0   
 
systemctl restart abrtd.service

Or execute the following command

sed -i "s/MaxCrashReportsSize = 5000/MaxCrashReportsSize = 0/g"  /etc/abrt/abrt.conf && systemctl restart abrtd.service

Method 2: Disable abrtd

When abrt-hook-ccpp performs a crash dump operation, the memory used may exceed expectations or the memory limit that the system can provide, causing impact on other applications. Therefore, we can also directly execute the following command to disable abrtd

systemctl stop abrt-ccpp.service
systemctl disable abrt-ccpp.service
systemctl status abrt-ccpp.service

Summarize

After performing the above operations, the business department observed it for a period of time and found no problem of frequent Java hangs.