The front-end guy compiled a tutorial on customizing JavaScript runtime based on v8. In this part, I will first review the relevant pre-requisite knowledge.
Contents
1 source file
// demo.c #include<stdio.h> int main() { printf("hello\n"); }
- example:
- C/C++
.c
.cpp
- Rust
.rs
- Go
.go
- C/C++
2 object file
After the compilation step, the source file is compiled into an object file. The object file contains binary machine code and some meta-information:
gcc -c demo.c
-c
Specify gcc to only compile, not link. The result is that each source file will generate a corresponding binary object file. The content in the object file is already binary, but it may not be executed yet.
In some operating systems, object file and executable file use the same format, such as ELF format in Linux. In Windows, object files use COFF format, while executable files use PE format.
You can use objdump to parse the generated demo.o object file. Examples will be given later.
- example:
- Further reading:
2.1 header file
Insert an additional concept header file in c/c++. For example, #include<stdio.h>
stdio.h in the previous demo.c is a header file.
When we compile c/c++, if we want to use external things, such as printf in the demo, which is an interface in the c standard library, we need to introduce the corresponding header file. The function of the header file is to tell the compiler what the definition of the printf function used in my demo.c code looks like, so that the compiler can correctly generate the object file.
However, in the symbol table of the compiled object file at this time, printf has not been specifically defined, and the next connection step is required. Following the demo2.o example in the static library chapter below, we can see this intuitively using the nm tool:
nm – list symbols from object files List the symbol table of object file
$ nm demo2.o U _add 0000000000000000 T _main U _printf
_add
The and _printf
in front of these two symbols U
means that they have not been defined yet. And if you look at the demo2 executable file after the final connection, the result will be:
$ nm demo2 0000000100000000 T __mh_execute_header 0000000100003f80 T _add 0000000100003f50 T _main U _printf
_add
The symbol is defined, but _printf
still not defined, but demo2 can run normally at this time because _printf
it is dynamically linked. After running, the libSystem library will be dynamically linked to our demo2.
2.2 stdio.h is not mysterious
As mentioned earlier, we introduced the stdio.h header file just to tell the compiler that we are going to use a function called printf and the definition of this function, so that it has enough information to complete the compilation work. So if we know what the definition of printf is, we don’t need to include stdio.h at all:
1 Write your own header file
// mystdio.h int printf(const char *format, ...);
2 Change the code to include our own header file
// #include <stdio.h> // #include <...> is used to introduce the header file in the system search path // #include "..." is used to introduce the header file in this project / / https://gcc.gnu.org/onlinedocs/cpp/Include-Syntax.html # include "mystdio.h"
3 Compile, connect and execute
$ gcc -o demo demo.c # generate demo $ ./demo hello
2.3 object file has nothing to do with source code language
There is an interesting point that some people may not realize. The object file compiled by the previous C language demo actually has nothing to do with C. The C code in the object file has become machine code. So this means that we can connect object files compiled in different languages together. For example, c, rust, go, etc. for example:
- Call the object file generated by c in rust
- Call rust compiled object file in c
Of course, it is not that simple in fact. Binaries compiled by different high-level languages will most likely not be able to understand each other (extended reading ” Application binary interface “). So in fact, if you want to achieve this operation, you may need some additional glue work, such as https://www.swig.org/
3 executable file
The function of the linker connector is to connect a set of obejct files or archive files, relocate their data, bind symbol references, and generate executable files.
After the connection step, connect the object file to an executable file:
gcc -o demo demo.o
Or one command to complete the compilation and connection process:
gcc -o demo demo.c
In the connection process demonstrated here, gcc is used instead of the ld tool directly because gcc has some default configurations that can simplify our work. For example, you can try to use ld directly, but there is a high probability that something will go wrong:
$ ld demo.o ld: Undefined symbols: _printf, referenced from: _main in demo2.o
This is because the printf function of stdio is used in demo.c, but when ld tries to generate the execute file, the stdio library and our demo.o object file are not connected together. As a result, after the connection is completed, the symbol _printf is found to be undefined.
Adding -lstdio
parameters can ask ld to connect the stdio library into the generated execute file:
$ ld demo.o -lSystem ld: library 'System' not found
Wrong again, ld doesn’t know where to find the System library (stdio in macOS is provided by the libSystem library), so it has to provide the search path:
$ ld demo.o -L/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -lSystem
4 static library
static link library
- suffix
- Under linux/macos
.a
archive
- under windows
.lib
library
- Under linux/macos
Here we explain the archive. As the name indicates, a .a
static link library file is actually a group of object files packaged together. You can experience it with the following demo:
1 Write library source code:
// mylib.h int add(int x, int y); // mylib.c int add(int x, int y) { return x + y; }
2 Build .a
a static link library
gcc -c mylib.c # Compile an object file called mylib.o ar cr libmylib.a mylib.o # Use the cr (create) subcommand of the ar (archive) tool # to create an archive file named libmylib.a # which contains the object file mylib.o
3. Write an application that uses mylib library
// demo2.c #include<stdio.h> #include "mylib.h" int main() { printf("%d\n", add(1,2)); }
4 Compile and connect
gcc -c demo2.c # Generate demo2.o gcc -o demo2 demo2.o -L. -lmylib # -L<xxx> tells the linker where to search for library files. Because our libmylib.a is not in the standard search path, we need to specify it explicitly. # -lmylib tells the linker to connect the library mylib to the product.
5 The demo2 excutable file generated after executing the connection
$./demo2 3
dynamic link library
- suffix
- Under linux
.so
shared object
- macos under
.dylib
dynamic library
- under windows
.dll
dynamic-link library
- Under linux
Unlike a static library, relying on a shared library will not cause the linker to merge the contents of the library into the final excutable file, but will only record some meta-information for the operating system to dynamically find the dependent shared library at runtime and compare it with ours. Excutable files are connected.
We .so
will explain here .a
that unlike , shared object is not a simple collection of object files. It is actually the product of multiple objects connected by a linker. We continue to demonstrate the use of shared library based on the static library chapter:
1 The library source code and demo source code are the same, no adjustment is required.
2 Build a .so
shared library
gcc -shared -o libmylib.so mylib.o # Compile a shared library called libmylib.so from mylib.o
3 Compile and connect
gcc -o demo3 demo2.o -L. -lmylib # Generate an executable file named demo3 # Note that except that the file name of the generated executable file has changed from demo2 to demo3, the rest is exactly the same as the static library!
4 The demo3 excutable file generated after executing the connection
#./demo3 3
In the third step, you may be confused. Why does the linker use shared library instead of static library this time using exactly the same command?
Because the default behavior of linker is to use the shared library, if the shared library is found first .so
, it will be used instead of .a
the static library.
How can we confirm that our demo2 and demo3 are indeed connected to mylib library in different ways? One way is to compare the file size. Generally speaking, the product of dynamic link is smaller than that of static link. But mylib in our example is too small, so the comparison won’t be obvious. Therefore, you can use some tools to visually view the libraries that the executable file depends on. You can use otool on mac:
$otool -L demo2 demo2: /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1336.0.0) $otool -L demo3 demo3: libmylib.so (compatibility version 0.0.0, current version 0.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1336.0.0)
You can see that demo2 depends on it libSystem.B.dylib
. As mentioned before, this library provides printf. Here you can see that it is also dynamically connected to our demo2.
demo3 relies on a shared library of libmylib.so. There is no such item in demo2 because the content of libmylib.a is directly merged into the file of demo2.
According to my own experience, if you can understand these products at the bottom level, you will have a more intuitive and in-depth understanding of the subject. It doesn’t need to be very in-depth, just an intuitive impression. Using the objdump tool we can parse these products.
First, let’s look at the contents of a relatively simple mylib.o object file:
1 Compile mylib.o
gcc -c -fno-asynchronous-unwind-tables mylib.c # -fno-asynchronous-unwind-tables This parameter can prevent things we don’t care from being mixed into the object file, making our example as simple as possible
2 Use objdump to parse mylib.o
$ objdump -s -d mylib.o # -s displays all non-empty sections # -d machine code in the disassembly file mylib.o: file format mach-o 64-bit x86-64 # mach-o is the static link library file format used by macOS, and is also used in other scenarios such as executable files and dynamic link libraries. # 64-bit x86-64 omitted # This part of the content is from some elements in our object file obtained from information Contents of section __TEXT,__text: #Compiled machine code usually exists in the text segment 0000 554889e5 897dfc89 75f88b45 fc0345f8 UH...}..u..E..E. 0010 5dc3]. Disassembly of section __TEXT,__text: # As mentioned earlier, compiled machine code usually exists in a text segment. Here is the disassembly result of that piece of machine code. 0000000000000000 <_add>: #Compiled machine code of the add function in our mylib # Address: Assembly code corresponding to the machine code 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 89 7d fc movl %edi, -4(%rbp) 7: 89 75 f8 movl %esi, -8(%rbp) a: 8b 45 fc movl -4(%rbp), %eax d: 03 45 f8 addl -8(%rbp), %eax 10: 5d popq %rbp 11: c3 retq
3 Compile demo2.o
$ gcc -c -fno-asynchronous-unwind-tables mylib.c
4 Use objdump to parse demo2.o
$ objdump -s -d demo2.o demo2.o: file format mach-o 64-bit x86-64 Contents of section __TEXT,__text: 0000 554889e5 bf010000 00be0200 0000e800 UH.............. 0010 00000089 c6488d3d 0b000000 b000e800 .....H.=........ 0020 00000031 c05dc3 ...1.]. Contents of section __TEXT,__cstring: #Text content in the code 0027 25640a00 %d.. # 25:% # 64:d # 0a:\n # 00:\0 c Have you still forgotten the knowledge in language class? The string in C language ends with \0. When you read \0, you know the string ends. Disassembly of section __TEXT,__text: 0000000000000000 <_main>: # The machine code compiled by our main function 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: bf 01 00 00 00 movl $1 , %edi 9: be 02 00 00 00 movl $2 , %esi e: e8 00 00 00 00 callq 0x13 <_main+0x13> 13: 89 c6 movl %eax, %esi 15: 48 8d 3d 0b 00 00 00 leaq 11(%rip), %rdi ## 0x27 <_main+0x27> 1c: b0 00 movb $0, %al 1e: e8 00 00 00 00 callq 0x23 <_main+0x23> 23: 31 c0 xorl %eax, %eax 25: 5d popq %rbp 26: c3 retq
5 Finally, take a look at the demo2 executable file after connection
$ objdump -s -d demo2 demo2: file format mach-o 64-bit x86-64 Contents of section __TEXT,__text: # add function 100003f40 554889e5 897dfc89 75f88b45 fc0345f8 UH...}..u..E..E. 100003f50 5dc30000 00000000 00000000 00000000 ]............... # main function 100003f60 554889e5 bf010000 00be0200 0000e8cd UH.............. 100003f70 ffffff89 c6488d3d 11000000 b000e804 .....H.=........ 100003f80 00000031 c05dc3 ...1.]. Contents of section __TEXT,__stubs: 100003f87 ff257300 0000 .%s... Contents of section __TEXT,__cstring: 100003f8d 25640a00 %d.. Contents of section __TEXT,__unwind_info: # Irrelevant, omitted 100003f94 01000000 1c000000 00000000 1c000000 ................ 100003fa4 00000000 1c000000 02000000 403f0000 ............@?.. 100003fb4 40000000 40000000 873f0000 00000000 @...@....?...... 100003fc4 40000000 00000000 00000000 00000000 @............... 100003fd4 03000000 0c000200 14000200 00000000 ................ 100003fe4 20000001 00000000 00000001 00000000 ............... Contents of section __DATA_CONST,__got: # Irrelevant, omitted 100004000 00000000 00000080 ........ Disassembly of section __TEXT,__text: # Remember the address of the add function 0x100003f40 0000000100003f40 <_add>: 100003f40: 55 pushq %rbp 100003f41: 48 89 e5 movq %rsp, %rbp 100003f44: 89 7d fc movl %edi, -4(%rbp) 100003f47: 89 75 f8 movl %esi, -8(%rbp) 100003f4a: 8b 45 fc movl -4(%rbp), %eax 100003f4d: 03 45 f8 addl -8(%rbp), %eax 100003f50: 5d popq %rbp 100003f51: c3 retq ... 100003f5e: 00 00 addb %al, (%rax) 0000000100003f60 <_main>: 100003f60: 55 pushq %rbp 100003f61: 48 89 e5 movq %rsp, %rbp 100003f64: bf 01 00 00 00 movl $1, %edi 100003f69: be 02 00 00 00 movl $2 , %esi # Call the add function, the address of the add function mentioned earlier is 0x100003f40 100003f6e: e8 cd ff ff ff callq 0x100003f40 <_add> 100003f73: 89 c6 movl %eax, %esi 100003f75: 48 8d 3d 11 00 00 00 leaq 17(%rip), %rdi ## 0x100003f8d <_printf+0x100003f8d> 100003f7c: b0 00 movb $0 , %al # Call printf, the address 0x100003f87 is behind 100003f7e: e8 04 00 00 00 callq 0x100003f87 <_printf+0x100003f87> 100003f83: 31 c0 xorl %eax, %eax 100003f85: 5d popq %rbp 100003f86: c3 retq Disassembly of section __TEXT,__stubs: 0000000100003f87 <__stubs>: # Here, the libSystem where printf is located is dynamically linked, and the situation is different. 100003f87: ff 25 73 00 00 00 jmpq *115(%rip) ## 0x100004000 <_printf+0x100004000>
6 Since we want to understand, can we not use tools like objdump?
If you open demo2 with hex editor, you can directly see the content of the ascii code:
objdump is not magic, it just understands the mach-o file format and interprets it according to the specifications:
https://github.com/aidansteele/osx-abi-macho-file-format-reference
If you want to know more, you can refer to this article: https://yurylapitsky.com/exploring_mach-o_binaries