Customizing javascript runtime – Part 1: Prerequisite knowledge

The front-end guy compiled a tutorial on customizing JavaScript runtime based on v8. In this part, I will first review the relevant pre-requisite knowledge.

1 source file

// demo.c
#include<stdio.h>
int main() {
    printf("hello\n");
}
  • example:
    • C/C++ .c .cpp
    • Rust .rs
    • Go .go

2 object file

After the compilation step, the source file is compiled into an object file. The object file contains binary machine code and some meta-information:

gcc -c demo.c

-cSpecify gcc to only compile, not link. The result is that each source file will generate a corresponding binary object file. The content in the object file is already binary, but it may not be executed yet.

In some operating systems, object file and executable file use the same format, such as ELF format in Linux. In Windows, object files use COFF format, while executable files use PE format.

You can use objdump to parse the generated demo.o object file. Examples will be given later.

2.1 header file

Insert an additional concept header file in c/c++. For example, #include<stdio.h>stdio.h in the previous demo.c is a header file.

When we compile c/c++, if we want to use external things, such as printf in the demo, which is an interface in the c standard library, we need to introduce the corresponding header file. The function of the header file is to tell the compiler what the definition of the printf function used in my demo.c code looks like, so that the compiler can correctly generate the object file.

However, in the symbol table of the compiled object file at this time, printf has not been specifically defined, and the next connection step is required. Following the demo2.o example in the static library chapter below, we can see this intuitively using the nm tool:

nm – list symbols from object files List the symbol table of object file

$ nm demo2.o
                 U _add
0000000000000000 T _main
                 U _printf

_addThe and _printfin front of these two symbols Umeans that they have not been defined yet. And if you look at the demo2 executable file after the final connection, the result will be:

$ nm demo2
0000000100000000 T __mh_execute_header
0000000100003f80 T _add
0000000100003f50 T _main
                 U _printf

_addThe symbol is defined, but _printfstill not defined, but demo2 can run normally at this time because _printfit is dynamically linked. After running, the libSystem library will be dynamically linked to our demo2.

2.2 stdio.h is not mysterious

As mentioned earlier, we introduced the stdio.h header file just to tell the compiler that we are going to use a function called printf and the definition of this function, so that it has enough information to complete the compilation work. So if we know what the definition of printf is, we don’t need to include stdio.h at all:

1 Write your own header file

// mystdio.h
int printf(const char *format, ...);

2 Change the code to include our own header file

// #include <stdio.h> 
// #include <...> is used to introduce the header file in the system search path 
// #include "..." is used to introduce the header file in this project 
/ / https://gcc.gnu.org/onlinedocs/cpp/Include-Syntax.html 
# include  "mystdio.h"

3 Compile, connect and execute

$ gcc -o demo demo.c
 # generate demo
$ ./demo
hello

2.3 object file has nothing to do with source code language

There is an interesting point that some people may not realize. The object file compiled by the previous C language demo actually has nothing to do with C. The C code in the object file has become machine code. So this means that we can connect object files compiled in different languages ​​together. For example, c, rust, go, etc. for example:

Of course, it is not that simple in fact. Binaries compiled by different high-level languages ​​will most likely not be able to understand each other (extended reading ” Application binary interface “). So in fact, if you want to achieve this operation, you may need some additional glue work, such as https://www.swig.org/

3 executable file

The function of the linker connector is to connect a set of obejct files or archive files, relocate their data, bind symbol references, and generate executable files.

After the connection step, connect the object file to an executable file:

gcc -o demo demo.o

Or one command to complete the compilation and connection process:

gcc -o demo demo.c

In the connection process demonstrated here, gcc is used instead of the ld tool directly because gcc has some default configurations that can simplify our work. For example, you can try to use ld directly, but there is a high probability that something will go wrong:

$ ld demo.o
ld: Undefined symbols:
  _printf, referenced from:
      _main in demo2.o

This is because the printf function of stdio is used in demo.c, but when ld tries to generate the execute file, the stdio library and our demo.o object file are not connected together. As a result, after the connection is completed, the symbol _printf is found to be undefined.

Adding -lstdioparameters can ask ld to connect the stdio library into the generated execute file:

$ ld demo.o -lSystem
ld: library 'System' not found

Wrong again, ld doesn’t know where to find the System library (stdio in macOS is provided by the libSystem library), so it has to provide the search path:

$ ld demo.o -L/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -lSystem

4 static library

static link library

  • suffix
    • Under linux/macos
      • .a archive
    • under windows
      • .lib library

Here we explain the archive. As the name indicates, a .astatic link library file is actually a group of object files packaged together. You can experience it with the following demo:

1 Write library source code:

// mylib.h
int add(int x, int y);

// mylib.c
int add(int x, int y) {
  return x + y;
}

2 Build .aa static link library

gcc -c mylib.c
 # Compile an object file called mylib.o
ar cr libmylib.a mylib.o
# Use the cr (create) subcommand of the ar (archive) tool 
# to create an archive file named libmylib.a 
# which contains the object file mylib.o

3. Write an application that uses mylib library

// demo2.c
#include<stdio.h>
#include "mylib.h"

int main() {
  printf("%d\n", add(1,2));
}

4 Compile and connect

gcc -c demo2.c
 # Generate demo2.o
gcc -o demo2 demo2.o -L. -lmylib
# -L<xxx> tells the linker where to search for library files. Because our libmylib.a is not in the standard search path, we need to specify it explicitly. 
# -lmylib tells the linker to connect the library mylib to the product.

5 The demo2 excutable file generated after executing the connection

$./demo2
3

5 shared library

dynamic link library

  • suffix
    • Under linux
      • .so shared object
    • macos under
      • .dylib dynamic library
    • under windows
      • .dll dynamic-link library

Unlike a static library, relying on a shared library will not cause the linker to merge the contents of the library into the final excutable file, but will only record some meta-information for the operating system to dynamically find the dependent shared library at runtime and compare it with ours. Excutable files are connected.

We .sowill explain here .athat unlike , shared object is not a simple collection of object files. It is actually the product of multiple objects connected by a linker. We continue to demonstrate the use of shared library based on the static library chapter:

1 The library source code and demo source code are the same, no adjustment is required.

2 Build a .soshared library

gcc -shared -o libmylib.so mylib.o
 # Compile a shared library called libmylib.so from mylib.o

3 Compile and connect

gcc -o demo3 demo2.o -L. -lmylib
 # Generate an executable file named demo3 
# Note that except that the file name of the generated executable file has changed from demo2 to demo3, the rest is exactly the same as the static library!

4 The demo3 excutable file generated after executing the connection

#./demo3 
3

In the third step, you may be confused. Why does the linker use shared library instead of static library this time using exactly the same command?

Because the default behavior of linker is to use the shared library, if the shared library is found first .so, it will be used instead of .athe static library.

How can we confirm that our demo2 and demo3 are indeed connected to mylib library in different ways? One way is to compare the file size. Generally speaking, the product of dynamic link is smaller than that of static link. But mylib in our example is too small, so the comparison won’t be obvious. Therefore, you can use some tools to visually view the libraries that the executable file depends on. You can use otool on mac:

$otool -L demo2
demo2:
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1336.0.0)

$otool -L demo3
demo3:
        libmylib.so (compatibility version 0.0.0, current version 0.0.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1336.0.0)

You can see that demo2 depends on it libSystem.B.dylib. As mentioned before, this library provides printf. Here you can see that it is also dynamically connected to our demo2.

demo3 relies on a shared library of libmylib.so. There is no such item in demo2 because the content of libmylib.a is directly merged into the file of demo2.

6 What’s in object file/shared object/excutable file

According to my own experience, if you can understand these products at the bottom level, you will have a more intuitive and in-depth understanding of the subject. It doesn’t need to be very in-depth, just an intuitive impression. Using the objdump tool we can parse these products.

First, let’s look at the contents of a relatively simple mylib.o object file:

1 Compile mylib.o

gcc -c -fno-asynchronous-unwind-tables mylib.c
 # -fno-asynchronous-unwind-tables This parameter can prevent things we don’t care from being mixed into the object file, making our example as simple as possible

2 Use objdump to parse mylib.o

$ objdump -s -d mylib.o
 # -s displays all non-empty sections 
# -d machine code in the disassembly file

mylib.o:        file format mach-o 64-bit x86-64
# mach-o is the static link library file format used by macOS, and is also used in other scenarios such as executable files and dynamic link libraries. 
# 64-bit x86-64 omitted 
# This part of the content is from some elements in our object file obtained from information
Contents of section __TEXT,__text:
#Compiled machine code usually exists in the text segment
 0000 554889e5 897dfc89 75f88b45 fc0345f8 UH...}..u..E..E.
 0010 5dc3].

Disassembly of section __TEXT,__text:
# As mentioned earlier, compiled machine code usually exists in a text segment. Here is the disassembly result of that piece of machine code.

0000000000000000 <_add>:
#Compiled machine code of the add function in our mylib 
# Address: Assembly code corresponding to the machine code
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: 89 7d fc                       movl %edi, -4(%rbp)
       7: 89 75 f8 movl %esi, -8(%rbp)
       a: 8b 45 fc                      movl    -4(%rbp), %eax
       d: 03 45 f8                      addl    -8(%rbp), %eax
      10: 5d                            popq    %rbp
      11: c3 retq

3 Compile demo2.o

$ gcc -c -fno-asynchronous-unwind-tables mylib.c

4 Use objdump to parse demo2.o

$ objdump -s -d demo2.o

demo2.o:        file format mach-o 64-bit x86-64
Contents of section __TEXT,__text:
 0000 554889e5 bf010000 00be0200 0000e800  UH..............
 0010 00000089 c6488d3d 0b000000 b000e800  .....H.=........
 0020 00000031 c05dc3 ...1.].
Contents of section __TEXT,__cstring:
#Text content in the code
 0027 25640a00                             %d..
# 25:% 
# 64:d 
# 0a:\n 
# 00:\0 c Have you still forgotten the knowledge in language class? The string in C language ends with \0. When you read \0, you know the string ends.

Disassembly of section __TEXT,__text:

0000000000000000 <_main>:
# The machine code compiled by our main function
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: bf 01 00 00 00 movl     $1 , %edi
       9: be 02 00 00 00 movl     $2 , %esi
       e: e8 00 00 00 00                callq   0x13 <_main+0x13>
      13: 89 c6 movl %eax, %esi
      15: 48 8d 3d 0b 00 00 00          leaq    11(%rip), %rdi          ## 0x27 <_main+0x27>
      1c: b0 00                         movb    $0, %al
      1e: e8 00 00 00 00                callq   0x23 <_main+0x23>
      23: 31 c0                         xorl    %eax, %eax
      25: 5d                            popq    %rbp
      26: c3 retq

5 Finally, take a look at the demo2 executable file after connection

$ objdump -s -d demo2   

demo2:  file format mach-o 64-bit x86-64
Contents of section __TEXT,__text:
# add function
 100003f40 554889e5 897dfc89 75f88b45 fc0345f8  UH...}..u..E..E.
 100003f50 5dc30000 00000000 00000000 00000000  ]...............
# main function
 100003f60 554889e5 bf010000 00be0200 0000e8cd  UH..............
 100003f70 ffffff89 c6488d3d 11000000 b000e804  .....H.=........
 100003f80 00000031 c05dc3                      ...1.].
Contents of section __TEXT,__stubs:
 100003f87 ff257300 0000                        .%s...
Contents of section __TEXT,__cstring:
 100003f8d 25640a00                             %d..
Contents of section __TEXT,__unwind_info:
# Irrelevant, omitted
 100003f94 01000000 1c000000 00000000 1c000000  ................
 100003fa4 00000000 1c000000 02000000 403f0000  ............@?..
 100003fb4 40000000 40000000 873f0000 00000000  @...@....?......
 100003fc4 40000000 00000000 00000000 00000000  @...............
 100003fd4 03000000 0c000200 14000200 00000000  ................
 100003fe4 20000001 00000000 00000001 00000000   ...............
Contents of section __DATA_CONST,__got:
# Irrelevant, omitted
 100004000 00000000 00000080                    ........

Disassembly of section __TEXT,__text:

# Remember the address of the add function 0x100003f40
0000000100003f40 <_add>:
100003f40: 55                           pushq   %rbp
100003f41: 48 89 e5                     movq    %rsp, %rbp
100003f44: 89 7d fc                      movl %edi, -4(%rbp)
100003f47: 89 75 f8                     movl    %esi, -8(%rbp)
100003f4a: 8b 45 fc                     movl    -4(%rbp), %eax
100003f4d: 03 45 f8                     addl    -8(%rbp), %eax
100003f50: 5d                           popq    %rbp
100003f51: c3 retq
                ...
100003f5e: 00 00 addb %al, (%rax)

0000000100003f60 <_main>:
100003f60: 55                           pushq   %rbp
100003f61: 48 89 e5                     movq    %rsp, %rbp
100003f64: bf 01 00 00 00               movl    $1, %edi
100003f69: be 02 00 00 00 movl     $2 , %esi
 # Call the add function, the address of the add function mentioned earlier is 0x100003f40 
100003f6e: e8 cd ff ff ff callq 0x100003f40 <_add>
100003f73: 89 c6                        movl    %eax, %esi
100003f75: 48 8d 3d 11 00 00 00 leaq 17(%rip), %rdi           ## 0x100003f8d <_printf+0x100003f8d> 
100003f7c: b0 00 movb     $0 , %al
 # Call printf, the address 0x100003f87 is behind
100003f7e: e8 04 00 00 00               callq   0x100003f87 <_printf+0x100003f87>
100003f83: 31 c0                        xorl    %eax, %eax
100003f85: 5d                           popq    %rbp
100003f86: c3 retq

Disassembly of section __TEXT,__stubs:

0000000100003f87 <__stubs>:
# Here, the libSystem where printf is located is dynamically linked, and the situation is different. 
100003f87: ff 25 73 00 00 00 jmpq *115(%rip)               ## 0x100004000 <_printf+0x100004000>

6 Since we want to understand, can we not use tools like objdump?

If you open demo2 with hex editor, you can directly see the content of the ascii code:

objdump is not magic, it just understands the mach-o file format and interprets it according to the specifications:

https://github.com/aidansteele/osx-abi-macho-file-format-reference

If you want to know more, you can refer to this article: https://yurylapitsky.com/exploring_mach-o_binaries