scc

simple c99 compiler
git clone git://git.simple-cc.org/scc
Log | Files | Refs | Submodules | README | LICENSE

commit 36c01fefa0272b5667c2177db464183f342b5f86
parent 248f4c963157080d7b63ef181d66b89b51f0dde3
Author: Roberto E. Vargas Caballero <k0ga@shike2.net>
Date:   Thu, 19 Mar 2026 10:54:37 +0100

doc: Update documentation about IR

Diffstat:
Mdoc/Makefile | 1+
Adoc/man7/Makefile | 24++++++++++++++++++++++++
Adoc/man7/scc-ir.man | 676+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mscripts/dirs | 1+
Mscripts/proto.all | 4++++
Dsrc/cmd/scc-cc/cc1/ir.md | 443-------------------------------------------------------------------------------
6 files changed, 706 insertions(+), 443 deletions(-)

diff --git a/doc/Makefile b/doc/Makefile @@ -3,6 +3,7 @@ DIRS =\ man1\ man3\ + man7\ PROJECTDIR = .. include $(PROJECTDIR)/scripts/rules.mk diff --git a/doc/man7/Makefile b/doc/man7/Makefile @@ -0,0 +1,24 @@ +.POSIX: + +PROJECTDIR = ../.. +include $(PROJECTDIR)/scripts/rules.mk + +PAGES =\ + scc-ir.7\ + +.SUFFIXES: .man + +NODEP = 1 + +all: $(PAGES) + +.man.7: + trap 'rm -f $tmp' EXIT;\ + trap 'exit 1' HUP INT TERM;\ + tmp=$$$$.tmp;\ + sed '/\.TH/ s/VERSION/$(VERSION)/' $< > $$tmp &&\ + mv $$tmp $@ + cp $@ $(MANDIR)/man7/ + +clean: + rm -f *.7 diff --git a/doc/man7/scc-ir.man b/doc/man7/scc-ir.man @@ -0,0 +1,676 @@ +.TH SCC-IR 7 scc\-VERSION +.SH NAME +scc-ir \- scc intermediate representation +.SH DESCRIPTION +The scc intermediate representation (IR) is a text-based format +used to communicate between the compiler frontend +.RB ( cc1 ) +and the compiler backend +.RB ( cc2 ). +It is designed to be simple and easily parseable: +all types and operators are represented by one or two characters, +so parsing tables can be used to process it. +.PP +The language is composed of lines representing statements. +Each line is composed of tab-separated fields. +Declarations begin in column 0; +expressions and control flow statements begin with a tab character. +When the frontend detects an error, +it closes the output stream. +.SH TYPES +Types are represented with single characters: +.PP +.TS +l l. +B bool +C signed 8-bit integer +K unsigned 8-bit integer +I signed 16-bit integer +N unsigned 16-bit integer +W signed 32-bit integer +Z unsigned 32-bit integer +Q signed 64-bit integer +O unsigned 64-bit integer +J float +D double +H long double +0 void +P pointer +F function +E function with ellipsis +V array (vector) +U union +S struct +1 \fI__builtin_va_arg\fR +.TE +.PP +Aggregate and composed types +.RB ( S , +.BR U , +.BR V ) +are followed by a numeric identifier +to distinguish between multiple types of the same kind: +.BR S3 , +.BR V5 , +.BR U2 . +.PP +The sizes in the table above are nominal. +Actual sizes depend on the target architecture. +For example, on amd64-sysv, +.B int +is 32-bit and uses +.BR W , +while on z80-scc it is 16-bit and uses +.BR I . +.SH STORAGE CLASSES +Storage classes are represented with uppercase letters: +.PP +.TS +l l. +A automatic (local variable) +R register +G global (public, defined in this module) +X extern (declared in another module) +Y private (file-scope static) +T local (function-scope static) +M struct/union member +L label +.TE +.PP +A variable name in the IR is composed of a storage class letter +followed by a numeric identifier, for example: +.BR A1 , +.BR G2 , +.BR T3 , +.BR L4 . +.SH DECLARATIONS +.SS Variable declarations +A variable declaration consists of a variable name, +its type, and a quoted source name: +.PP +.RS +.I var +.B \et +.I type +.B \et " +.I name +.RE +.PP +For example: +.PP +.RS +.nf +A4 W "i +G2 W "g +X3 P "ptr +.fi +.RE +.SS Function declarations +Function declarations include the return type +and use +.B F +for the function type +.RB ( E +if the function has an ellipsis parameter): +.PP +.RS +.I var +.B \et +.I return-type +.B \et F \et " +.I name +.RE +.PP +For example: +.PP +.RS +.nf +.ta 8n 16n 24n +G2 W F "main +X3 W E "printf +T4 0 F "helper +.fi +.RE +.PP +.B G +marks a public function, +.B T +a file-scope static function, and +.B X +an extern declaration. +.SS Function definitions +A function definition starts with the function declaration, +followed by +.B { +on its own line. +Function parameters are declared inside the body. +A +.B \e +(backslash) on its own line separates +parameters from local variable declarations. +The body ends with +.BR } . +.PP +For example, the C source: +.PP +.RS +.nf +int func(int a, int b) { + int c; + return a + b; +} +.fi +.RE +.PP +generates: +.PP +.RS +.nf +.ta 8n 16n 24n +G2 W F "func +{ +A3 W "a +A4 W "b +\e +A6 W "c + h A3 A4 +W +} +.fi +.RE +.SS Struct and union declarations +A struct or union type declaration starts with a header line +containing the type letter and identifier, +a quoted tag name, +a hex-encoded size and a hex-encoded alignment: +.PP +.RS +.I type-id +.B \et " +.I tag +.B \et # +.IR size-letter size +.B \et # +.IR size-letter align +.RE +.PP +Member declarations follow, each including an offset field: +.PP +.RS +.I member-var +.B \et +.I type +.B \et " +.I name +.B \et # +.IR size-letter offset +.RE +.PP +For example, the C source: +.PP +.RS +.nf +struct point { + int x; + int y; +}; +struct point p; +.fi +.RE +.PP +generates (on amd64-sysv): +.PP +.RS +.nf +.ta 8n 16n 24n +S3 "point #O8 #O4 +M4 W "x #O0 +M5 W "y #O4 +G6 S3 "p +.fi +.RE +.PP +Unions use +.B U +instead of +.BR S . +Members of a union typically share offset 0. +.SS Array type declarations +Array types are declared with +.BR V , +the element type, +and the number of elements in hexadecimal: +.PP +.RS +.nf +.ta 8n 16n 24n +V5 W #OA +.fi +.RE +.PP +This declares array type V5 with element type +.B W +(signed 32-bit integer) and 0xA (10) elements. +Array variable declarations reference the array type: +.PP +.RS +.nf +.ta 8n 16n 24n +A4 V5 "a +.fi +.RE +.SS Enum declarations +Enumerations are not emitted as types. +Enum variables are emitted with their underlying integer type +(typically +.BR W ): +.PP +.RS +.nf +G7 W "c +.fi +.RE +.SH INITIALIZERS +When a variable has an initializer, +the declaration line ends without a newline and is followed by +.B ( +on the same line. +The initializer expressions follow, +one per line, +and the initializer is closed with +.B ) +on its own line. +.PP +For example: +.PP +.RS +.nf +int g = 42; +.fi +.RE +.PP +generates: +.PP +.RS +.nf +.ta 8n 16n 24n +G2 W "g ( + #W2A +) +.fi +.RE +.PP +Array and struct initializers list each element: +.PP +.RS +.nf +int a[3] = {1, 2, 3}; +.fi +.RE +.PP +generates: +.PP +.RS +.nf +.ta 8n 16n 24n +V3 W #O3 +G2 V3 "a ( + #W1 + #W2 + #W3 +) +.fi +.RE +.PP +String initializers use a quoted form for printable runs +and individual byte constants for non-printable characters: +.PP +.RS +.nf +.ta 8n 16n 24n + #"hello + #C0 +.fi +.RE +.SH EXPRESSIONS +Expressions are emitted in reverse Polish notation (RPN), +with tab-separated tokens on a single line. +Every operator is followed by a type letter. +.SS Constants +Constants are introduced with +.BR # , +followed by a type letter and a hexadecimal value: +.PP +.RS +.nf +#W2A +.fi +.RE +.PP +This represents the integer constant 42 (0x2A) of type +.BR W . +.PP +Floating-point constants are emitted as the hexadecimal encoding +of their IEEE 754 representation: +.PP +.RS +.nf +#J3FC00000 +#D4004000000000000 +.fi +.RE +.PP +These represent float 1.5 and double 2.5, respectively. +.PP +String constants are emitted using +.B #" +for printable character runs: +.PP +.RS +.nf +#"hello +.fi +.RE +.SS Arithmetic operators +.TS +l l. ++ addition +\- subtraction +* multiplication +/ division +% modulo +l left shift +r right shift +.TE +.SS Comparison operators +.TS +l l. +< less than +> greater than +[ less or equal +] greater or equal +\&= equal +! not equal +.TE +.SS Bitwise operators +.TS +l l. +& bitwise and +| bitwise or +^ bitwise xor +~ bitwise complement (unary) +.TE +.SS Logical operators +.TS +l l. +a logical and (short-circuit) +o logical or (short-circuit) +n logical negation +.TE +.SS Unary operators +.TS +l l. +\&_ arithmetic negation +~ bitwise complement +n logical negation +\&' address-of +@ pointer dereference +.TE +.SS Assignment +.TS +l l. +: assignment +:* multiply and assign +:/ divide and assign +:% modulo and assign +:+ add and assign +:\- subtract and assign +:l left shift and assign +:r right shift and assign +:& bitwise and and assign +:^ bitwise xor and assign +:| bitwise or and assign +:i post-increment +:d post-decrement +.TE +.SS Other operators +.TS +l l. +, comma +? ternary (conditional) +\&. struct/union field access +g type cast (followed by target type letter) +.TE +.SS Function calls +Function calls use +.B p +to push each argument, +.B c +for the call itself, +and +.B z +for calls to variadic functions. +Each is followed by the type of the result: +.PP +.RS +.nf +.ta 8n 16n 24n 32n 40n 48n + X2 Y9 'P pP #W2A pW zW +.fi +.RE +.PP +This pushes a pointer argument +.RB ( pP ), +pushes an integer argument +.RB ( pW ), +and calls a variadic function returning +.BR W +.RB ( zW ). +.SS Builtin functions +Builtin function calls use +.B m +as the operator, +preceded by a quoted builtin name: +.PP +.RS +.nf +"__builtin_va_arg m +.fi +.RE +.SS Expression example +The C expression: +.PP +.RS +.nf +i = j + 2 * 3; +.fi +.RE +.PP +generates (on amd64-sysv): +.PP +.RS +.nf +.ta 8n 16n 24n 32n 40n + A4 A5 #W6 +W :W +.fi +.RE +.PP +Note that constant folding has reduced +.I 2*3 +to +.IR 6 . +The expression is in RPN: +push A4, push A5, push #W6, add (yielding W), assign (yielding W). +.SS Type casts +Casts are emitted as the operator +.B g +followed by the target type letter. +A cast to +.B void +is not emitted. +For example: +.PP +.RS +.nf +j = (long)i; +.fi +.RE +.PP +generates (on amd64-sysv): +.PP +.RS +.nf +.ta 8n 16n 24n 32n + A5 A4 gQ :Q +.fi +.RE +.SH STATEMENTS +.SS Labels +Labels begin in column 0 and consist of +.B L +followed by a numeric identifier: +.PP +.RS +.nf +L3 +.fi +.RE +.SS Unconditional jumps +An unconditional jump uses +.B j +followed by a label: +.PP +.RS +.nf +.ta 8n 16n + j L3 +.fi +.RE +.SS Conditional branches +A conditional branch uses +.BR y , +followed by a label. +The expression to evaluate follows on the next line. +If the expression evaluates to true (non-zero), the branch is taken: +.PP +.RS +.nf +.ta 8n 16n 24n 32n + y L5 A4 #W5 <W +.fi +.RE +.PP +Note that the frontend negates the condition: +the C code +.I "if (i > 5)" +is emitted as a branch on +.IR "i <= 5" , +jumping past the then-block when the original condition is false. +.SS Return +The return statement uses +.BR h . +If the function returns a value, +the expression follows on the same line: +.PP +.RS +.nf +.ta 8n 16n 24n 32n + h A3 A4 +W +.fi +.RE +.PP +A void return is emitted as +.B h +alone, followed by a blank expression line. +.SS Loops +Two markers indicate loop boundaries to the backend: +.PP +.TS +l l. +b beginning of loop body +e end of loop body +.TE +.PP +For example, a +.B while +loop: +.PP +.RS +.nf +while (i < 10) { ++i; } +.fi +.RE +.PP +generates: +.PP +.RS +.nf +.ta 8n 16n 24n 32n + j L5 +L4 + b + A4 #W1 :+W +L5 + e + y L4 A4 #WA <W +L6 +.fi +.RE +.SS Switch statements +A switch statement is bracketed by +.B s +(begin) and +.B t +(end). +The +.B s +marker is followed by the switch expression. +Case entries are emitted with +.BR v , +and the default entry with +.BR f . +The +.B t +marker takes the label where execution continues after the switch. +.PP +For example: +.PP +.RS +.nf +switch (n+1) { +case 1: +case 2: +case 3: +default: + ++n; +} +.fi +.RE +.PP +generates: +.PP +.RS +.nf +.ta 8n 16n 24n 32n + s A3 #W1 +W + v L6 #W1 +L6 + v L7 #W2 +L7 + v L8 #W3 +L8 + f L9 +L9 + A3 #W1 :+W + t L5 +L5 +.fi +.RE +.PP +Each +.B v +entry is followed by a label and a constant value. +The +.B f +(default) entry is followed by a label only. +.SH SEE ALSO +.BR scc-cc (1) diff --git a/scripts/dirs b/scripts/dirs @@ -16,3 +16,4 @@ lib/scc/amd64-dragonfly lib/scc/amd64-darwin share/man/man1 share/man/man3 +share/man/man7 diff --git a/scripts/proto.all b/scripts/proto.all @@ -211,11 +211,13 @@ d 755 share/man/man1 f 644 share/man/man1/scc-addr2line.1 f 644 share/man/man1/scc-ar.1 f 644 share/man/man1/scc-cc.1 +f 644 share/man/man1/scc-cpp.1 f 644 share/man/man1/scc-nm.1 f 644 share/man/man1/scc-objdump.1 f 644 share/man/man1/scc-ranlib.1 f 644 share/man/man1/scc-size.1 f 644 share/man/man1/scc-strip.1 +f 644 share/man/man1/scc.1 d 755 share/man/man3 f 644 share/man/man3/asctime.3 f 644 share/man/man3/clock.3 @@ -258,3 +260,5 @@ f 644 share/man/man3/strxfrm.3 f 644 share/man/man3/time.3 f 644 share/man/man3/time.h.3 f 644 share/man/man3/wchar.h.3 +d 755 share/man/man7 +f 644 share/man/man7/scc-ir.7 diff --git a/src/cmd/scc-cc/cc1/ir.md b/src/cmd/scc-cc/cc1/ir.md @@ -1,443 +0,0 @@ -# scc intermediate representation # - -The scc IR tries to be be a simple and easily parseable intermediate -representation, and it makes it a bit terse and cryptic. The main -characteristic of the IR is that all the types and operations are -represented with only one letter, so parsing tables can be used -to parse it. - -The language is composed of lines, representing statements. -Each statement is composed of tab-separated fields. -Declaration statements begin in column 0, expressions and -control flow begin with a tabulator. -When the frontend detects an error, it closes the output stream. - -## Types ## - -Types are represented with uppercase letters: - -* C -- signed 8-Bit integer -* I -- signed 16-Bit integer -* W -- signed 32-Bit integer -* Q -- signed 64-Bit integer -* K -- unsigned 8-Bit integer -* N -- unsigned 16-Bit integer -* Z -- unsigned 32-Bit integer -* O -- unsigned 64-Bit integer -* 0 -- void -* P -- pointer -* F -- function -* V -- vector -* U -- union -* S -- struct -* B -- bool -* J -- float -* D -- double -* H -- long double - -This list has been built for the original Z80 backend, where 'int' -has the same size as 'short'. Several types (S, F, V, U and others) need -an identifier after the type letter for better differentiation -between multiple structs, functions, vectors and unions (S1, V12 ...) -naturally occuring in a C-program. - -## Storage classes ## - -The storage classes are represented using uppercase letters: - -* A -- automatic -* R -- register -* G -- public (global variable declared in the module) -* X -- extern (global variable declared in another module) -* Y -- private (variable in file-scope) -* T -- local (static variable in function-scope) -* M -- member (struct/union member) -* L -- label - -## Declarations/definitions ## - -Variable names are composed of a storage class and an identifier -(e.g. A1, R2, T3). -Declarations and definitions are composed of a variable -name, a type and the name of the variable: - - A1 I maxweight - R2 C flag - A3 S4 statstruct - -### Type declarations ### - -Some declarations (e.g. structs) involve the declaration of member -variables. -Struct members are declared normally after the type declaration in -parentheses. - -For example the struct declaration - - struct foo { - int i; - long c; - } var1; - -generates - - S2 foo ( - M3 I i - M4 W c - ) - G5 S2 var1 - -## Functions ## - -A function prototype - - int printf(char *cmd, int flag, void *data); - -will generate a type declaration and a variable declaration - - F5 P I P - X1 F5 printf - -The first line gives the function-type specification 'F' with -an identifier '5' and subsequently lists the types of the -function parameters. -The second line declares the 'printf' function as a publicly -scoped variable. - -Analogously, a statically declared function in file scope - - static int printf(char *cmd, int flag, void *data); - -generates - - F5 P I P - T1 F5 printf - -Thus, the 'printf' variable went into local scope ('T'). - -A '{' in the first column starts the body of the previously -declared function: - - int printf(char *cmd, int flag, void *data) {} - -generates - - F5 P I P - G1 F5 printf - { - A2 P cmd - A3 I flag - A4 P data - - - } - -Again, the frontend must ensure that '{' appears only after the -declaration of a function. The character '-' marks the separation -between parameters and local variables: - - int printf(register char *cmd, int flag, void *data) {int i;}; - -generates - - F5 P I P - G1 F5 printf - { - R2 P cmd - A3 I flag - A4 P data - - - A6 I i - } - -### Expressions ### - -Expressions are emitted in reverse polish notation, simplifying -parsing and converting into a tree representation. - -#### Operators #### - -Operators allowed in expressions are: - -* \+ -- addition -* \- -- substraction -* \* -- multiplication -* % -- modulo -* / -- division -* l -- left shift -* r -- right shift -* < -- less than -* > -- greather than -* ] -- greather or equal than -* [ -- less or equal than -* = -- equal than -* ! -- different than -* & -- bitwise and -* | -- bitwise or -* ^ -- bitwise xor -* ~ -- bitwise complement -* : -- asignation -* _ -- unary negation -* c -- function call -* p -- parameter -* . -- field -* , -- comma operator -* ? -- ternary operator -* ' -- take address -* a -- logical shortcut and -* o -- logical shortcut or -* @ -- content of pointer - -Assignation has some suboperators: - -* :/ -- divide and assign -* :% -- modulo and assign -* :+ -- addition and assign -* :- -- substraction and assign -* :l -- left shift and assign -* :r -- right shift and assign -* :& -- bitwise and and assign -* :^ -- bitwise xor and assign -* :| -- bitwise or and assign -* :i -- post increment -* :d -- post decrement - -Every operator in an expression has a type descriptor. - -#### Constants #### - -Constants are introduced with the character '#'. For instance, 10 is -translated to #IA (all constants are emitted in hexadecimal), -where I indicates that it is an integer constant. -Strings are a special case because they are represented with -the " character. -The constant "hello" is emitted as "68656C6C6F. For example - - int - main(void) - { - int i, j; - - i = j+2*3; - } - -generates - - F1 - G1 F1 main - { - - - A2 I i - A3 I j - A2 A3 #I6 +I :I - } - -Type casts are expressed with a tuple denoting the -type conversion - - int - main(void) - { - int i; - long j; - - j = (long)i; - } - -generates - - F1 - G1 F1 main - { - - - A2 I i - A3 W j - A2 A3 WI :I - } - -### Statements ### -#### Jumps ##### - -Jumps have the following form: - - j L# [expression] - -the optional expression field indicates some condition which -must be satisfied to jump. Example: - - int - main(void) - { - int i; - - goto label; - label: - i -= i; - } - -generates - - F1 - G1 F1 main - { - - - A2 I i - j L3 - L3 - A2 A2 :-I - } - -Another form of jump is the return statement, which uses the -letter 'y' followed by a type identifier. -Depending on the type, an optional expression follows. - - int - main(void) - { - return 16; - } - -generates - - F1 - G1 F1 main - { - - - yI #I10 - } - - -#### Loops #### - -There are two special characters that are used to indicate -to the backend that the following statements are part of -a loop body. - -* b -- beginning of loop -* e -- end of loop - -#### Switch statement #### - -Switches are represented using a table, in which the labels -where to jump for each case are indicated. Common cases are -represented with 'v' and default with 'f'. -The switch statement itself is represented with 's' followed -by the label where the jump table is located, and the -expression of the switch: - - int - func(int n) - { - switch (n+1) { - case 1: - case 2: - case 3: - default: - ++n; - } - } - -generates - - F2 I - G1 F2 func - { - A1 I n - - - s L4 A1 #I1 +I - L5 - L6 - L7 - L8 - A1 #I1 :+I - j L3 - L4 - t #4 - v L7 #I3 - v L6 #I2 - v L5 #I1 - f L8 - L3 - } - -The beginning of the jump table is indicated by the the letter 't', -followed by the number of cases (including default case) of the -switch. - -## Resumen ## - -* C -- signed 8-Bit integer -* I -- signed 16-Bit integer -* W -- signed 32-Bit integer -* O -- signed 64-Bit integer -* M -- unsigned 8-Bit integer -* N -- unsigned 16-Bit integer -* Z -- unsigned 32-Bit integer -* Q -- unsigned 64-Bit integer -* 0 -- void -* P -- pointer -* F -- function -* V -- vector -* U -- union -* S -- struct -* B -- bool -* J -- float -* D -- double -* H -- long double -* A -- automatic -* R -- register -* G -- public (global variable declared in the module) -* X -- extern (global variable declared in another module) -* Y -- private (variable in file-scope) -* T -- local (static variable in function-scope) -* M -- member (struct/union member) -* L -- label -* { -- beginning of function body -* } -- end of function body -* \\ -- end of function parameters -* \+ -- addition -* \- -- substraction -* \* -- multiplication -* % -- modulo -* / -- division -* l -- left shift -* r -- right shift -* < -- less than -* > -- greather than -* ] -- greather or equal than -* [ -- less or equal than -* = -- equal than -* ! -- different than -* & -- bitwise and -* | -- bitwise or -* ^ -- bitwise xor -* ~ -- bitwise complement -* : -- asignation -* _ -- unary negation -* c -- function call -* p -- parameter -* . -- field -* , -- comma operator -* ? -- ternary operator -* ' -- take address -* a -- logical shortcut and -* o -- logical shortcut or -* @ -- content of pointer -* :/ -- divide and assign -* :% -- modulo and assign -* :+ -- addition and assign -* :- -- substraction and assign -* :l -- left shift and assign -* :r -- right shift and assign -* :& -- bitwise and and assign -* :^ -- bitwise xor and assign -* :| -- bitwise or and assign -* ;+ -- post increment -* ;- -- post decrement -* j -- jump -* y -- return -* b -- begin of loop -* d -- end of loop -* s -- switch statement -* t -- switch table -* v -- case entry in switch table -* f -- default entry in switch table