Record a question about GBK encoding

background

Distinguish between UTF-8 and GBK

GBK is a standard that is compatible with GB2312 after expansion based on the national standard GB2312. It is specially used to solve Chinese encoding. It is double-byte, both Chinese and English.

UTF-8 is an international encoding method that includes most languages ​​​​in the world (Simplified Chinese, Traditional Chinese, English, Japanese, Korean and other languages), and is also compatible with ASCII codes.

Although GBK saves two bytes less than UTF-8, GBK only contains Chinese, while UTF-8 contains characters needed by all countries in the world, so many systems or frameworks now use UTF-8 by default.

However, when business development docks with system interfaces, we often encounter some old systems that use GBK encoding, especially domestic payments. At this time, they need to be compatible with GBK and UTF-8 encoding.

How to make projects compatible with UTF-8 and GBK

We use spring boot version 2.7. Our default API uses UTF-8, and special APIs use GBK. In spring boot, when a request is received and parsed, CharacterEncodingFilterthe content encoding format will be specified through .

    @Override
    protected void doFilterInternal(
            HttpServletRequest request, HttpServletResponse response, FilterChain filterChain)
            throws ServletException, IOException {

        String  encoding  = getEncoding();
         if (encoding != null ) {
       // Set the encoding when the request comes in (when the request forced encoding is set or the request header does not set the encoding) 
            if (isForceRequestEncoding() || request.getCharacterEncoding() == null ) {
                request.setCharacterEncoding(encoding);
            }
      //Set response forced encoding 
            if (isForceResponseEncoding()) {
                response.setCharacterEncoding(encoding);
            }
        }
        filterChain.doFilter(request, response);
    }

You can modify the global coding method of the project in the following ways

server:
  servlet:
    encoding:
      charset: UTF-8
      force-request: false
      force-response: false

Look, by default, the encoding format is UTF-8 and it is forced to convert every time. What does it mean? That is to say, even if you request a header application/json;charset=GBK, it will not be parsed according to the encoding of the header, but will be forced to be converted to UTF-8. If the content comes from GBK, you will be waiting for garbled code!

    @Bean
    @ConditionalOnMissingBean
    public CharacterEncodingFilter characterEncodingFilter() {
        CharacterEncodingFilter filter = new OrderedCharacterEncodingFilter();
        filter.setEncoding(this.properties.getCharset().name());
        filter.setForceRequestEncoding(this.properties.shouldForce(Encoding.Type.REQUEST));
        filter.setForceResponseEncoding(this.properties.shouldForce(Encoding.Type.RESPONSE));
        return filter;
    }

public boolean shouldForce(Type type) {
        Boolean force = (type != Type.REQUEST) ? this.forceResponse : this.forceRequest;
        if (force == null) {
            force = this.force;
        }
        if (force == null) {
            force = (type == Type.REQUEST);
        }
        return force;
    }

There is now a payment system, and the content it requests is GBK encoded, and there are two methods of GET and POST. Our system interface defaults to UTF-8, so it can only be processed for specific GBK interfaces. The following request situations need to be supported.

  1. The POST request content of the other party is GBK encoded and the encoding method is specified in the request header application/json;charset=GBK.
  2. The POST request content of the other party is GBK encoded and the encoding method is not specified in the request header application/json.
  3. GET the request content of the other party in GBK encoding.

In the first case, we only need to turn off forced conversion, and if charset=gbk is provided, gbk encoding will be used for parsing. By default, if it is not provided, utf-8 will be used for parsing.

server:
  servlet:
    encoding:
            force: false

In the second and third cases, we need to turn off forced conversion first, and then add a high-priority filter to set the specified request to the GBK encoding format (that is, it must be processed before entering spring parsing). If using The tomcat container is handled as follows.

@Slf4j
@Configuration
public class GBKFilterConfig {
@Bean
public FilterRegistrationBean gbkFilter() {
FilterRegistrationBean registration = new FilterRegistrationBean();
registration.setDispatcherTypes(DispatcherType.REQUEST);
registration.setFilter(new Filter() {
@Override
public void init(javax.servlet.FilterConfig filterConfig) throws ServletException {
}

@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
RequestFacade req = (RequestFacade) request;
Class clazz = req.getClass();
log.info("GBK Filter...");
try {
Field field = clazz.getDeclaredField("request");
field.setAccessible(true);
Request r = (Request) field.get(req);
org.apache.coyote.Request p = r.getCoyoteRequest();
// GET 请求参数强使用 GBK 编码。
p.getParameters().setQueryStringCharset(Charset.forName("GBK"));
// POST 请求带头未指定编码,强制使用 GBK
p.getParameters().setCharset(Charset.forName("GBK"));
p.setCharset(Charset.forName("GBK"));
chain.doFilter(request, response);
} catch (Exception e) {
log.error("error", e)
}
}

@Override
public void destroy() {
}
});
registration.addUrlPatterns("/api/gbk/**");
registration.setName("gbkFilter");
registration.setOrder(Integer.MIN_VALUE);
return registration;
}
}