shnell – source-to-source compiler enhancement

Table of Contents

Jens Gustedt

0.0.1 Takeaway

We provide a tool to easily develop and prototype compiler and language enhancements that can be expressed by source-to-source transformation of C code. It uses #pragma directives to

  • mark parts or all of a code,
  • cut out that code,
  • pipe it through a transformation and
  • splice it back into the same place.

0.0.2 what do we get

Many small but convenient directives are already available, such as

0.0.3 how it is done

It is based on two major tools of POSIX systems:

  • shell programming (sh)
  • regular expression streaming (sed)

0.0.4 how it is used

  • The existing features can be used in daily programming without knowledge of these tools.
  • Filter programs that implement the directives can be written in any other programming language that suits the task:

perl, python, java, C itself …

  • Shnell features are easily applied by either

1 Introduction

1.0.1 a simple example

  • shnell performs source-to-source transformations
  • identify code ranges with the help of directives.
Example:
a declaration of an array A
double A[] = {
#pragma CMOD amend foreach ANIMAL:N = goose dog cat
   [${ANIMAL}] = 2*${N},
#pragma CMOD done
};

1.0.2 and its replacement

  • the #pragma ensure that the inner line is copied three times
  • "meta-variable" ${ANIMAL} iterates over "goose", "dog" and "cat"
  • ${N} holds the number of the current copy, starting at 0
double A[] = {
   [goose] = 2*0,
   [dog] = 2*1,
   [cat] = 2*2,
};

1.0.3 stringification

A similar code can use the "stringified" parameters

char const* names[] = {
#pragma CMOD amend foreach ANIMAL = goose dog cat
   [${ANIMAL}] = #${ANIMAL},
#pragma CMOD done
};

results in

char const* names[] = {
   [goose] = "goose",
   [dog] = "dog",
   [cat] = "cat",
};

1.0.4 general approach

In the general case such a directive is identified

  • with a tag, here CMOD (but that can be modified),
  • with a rule, here amend to say that the code up to the next done is changed,
  • with a directive, here foreach that names the script that is to, be run
  • and with a list of arguments to the directive, here the five tokens "ANIMAL = goose dog cat"

2 Command line tools

2.0.1 shnell

The central tool is shnell which reads as C file, processes it, and dumps the result to stdout. If you want to keep track (or inspect) the intermediate code, this would be your tool of choice. You can easily integrate that into a compilation chain by providing make rules that take your C sources and then store the resulting code in some other file for further processing.

2.0.2 executable dialects

It would be tedious to keep track of the modified sources and to apply the same set of directives to each source file of a project. Therefore there are

shnl files
that group directives together and create something like dialects of the C language, see load
compiler prefixes
that can be used to apply such a dialect and all included directives to a source and to run your favorite compiler directly on the result. Current such prefixes are shneller, trade and posix.

2.0.3 example: using trade

E.g applying the TRADE policy to a source file toto.c during compilation is as simple as

trade gcc -Wall -c -O3 -march=native toto.c

that is, we prefix the compiler command line by the additional command "trade". This will filter the file name and task to perform from the command line, perform the source-to-source rewriting and then compile the result to an object file toto.o.

2.0.4 example: using trade

Similarly, without -c,

trade gcc -Wall -O3 -march=native toto.c mind.o mund.o

would take the first source (toto.c) do all of the above and link all the objects into an executable toto if possible.

2.0.5 example: using trade

If there are only .o files

trade gcc -Wall -O3 -march=native toto.o mind.o mund.o

only the linker phase is performed.

2.0.6 example: using trade

In addition to the -c command line flag, these tools also understand -E to perform all rewriting and preprocessing, -S to produce an assembler file, and -M to produce nothing but the side effects of compilation such as the header file (see export).

3 Directives, how the C programmer sees them

Directives are found by one or several scans that shnell performs on a C source.

3.1 amend, insert and load directives

Directives come in three different flavors:

3.1.1 amend

  • The scope is up to
    • the next (nesting) "done" directive or
    • the end of the source file.
  • The code is piped into the command.
  • The result is inserted in place.
  • The command also receives the argument list of the directive over some side channel.
  • amend directives typically modify the code
Example:
foreach as above
  • the code is repeated several times
  • each copy is modified by resolving the meta-variables

3.1.2 insert

  • Has no scope.
  • The command does not receive input.
  • It only receives the arguments.
  • The result is inserted in place.
  • Scan of the source file then continues directly after.
  • The inserted code has no influence on shnell for this scan.
  • insert directive typically just puts some declarations or definitions in place.
Example:
enum directive
  • Defines an enumeration type and some depending functions.
#pragma CMOD insert enum animal = goose dog cat

3.1.3 load

  • Inserts a set of directives that are found in shnl files.
    • shnl files condensate complicated patterns
  • Scanning continues from the top of the inserted lines.
  • They may contain several amend directives that are not terminated.
  • Possibly multiple scans of the whole source file.
Example:
CONSTEXPR directive
  • Allows to have several nested "evaluations" of variables.
  • The result are expressions that are evaluated at compile time.

3.1.4 recursion

  • Nested occurrence of amend directives leads to finite recursion.
Example:
Two nested do directives:
double A[3][2] = {
#pragma CMOD amend do I = 3
    [${I}] = {
#pragma CMOD amend do J = 2
       [${J}] = 2*${I} + ${J},
#pragma CMOD done
    },
#pragma CMOD done
};
  1. start collecting the code SI immediately after the first do
  2. when collecting SI, the second do directive is encountered
  3. collection of the code SJ starting after that second do is started, until the first done is encountered. SJ now has:

    [${J}] = 2*${I} + ${J},
    
  4. SJ is fed into the do directive for variable J and value 2.
  5. The do directive replicates SJ twice and replaces all occurences of ${J} by 0 and 1, respectively, to obtain a code TJ:

    [0] = 2*${I} + 0,
    [1] = 2*${I} + 1,
    
  6. TJ is inserted into SI instead of the directive, resulting in a replaced code RI.
  7. The scan for the first directive is continued until the second done is encountered, resulting in a code QI:

    [${I}] = {
       [0] = 2*${I} + 0,
       [1] = 2*${I} + 1,
    },
    
  8. QI is fed into the do directive for variable I and value 3.
  9. The do directive replicates QI three times and replaces all occurences of ${I} by 0, 1, and 2, respectively, to obtain a code TI:

    [0] = {
       [0] = 2*0 + 0,
       [1] = 2*0 + 1,
    },
    [1] = {
       [0] = 2*1 + 0,
       [1] = 2*1 + 1,
    },
    [2] = {
       [0] = 2*2 + 0,
       [1] = 2*2 + 1,
    },
    
  10. TI is then inserted in place of the whole #pragma construct.

So after completion of the inner directive, after step 5, the code as if we had written:

double A[3][2] = {
#pragma CMOD amend do I = 3
    [${I}] = {
       [0] = 2*${I} + 0,
       [1] = 2*${I} + 1,
    },
#pragma CMOD done
};

Only then the outer directive is applied and the over all result after step 10 is

double A[3][2] = {
    [0] = {
       [0] = 2*0 + 0,
       [1] = 2*0 + 1,
    },
    [1] = {
       [0] = 2*1 + 0,
       [1] = 2*1 + 1,
    },
    [2] = {
       [0] = 2*2 + 0,
       [1] = 2*2 + 1,
    },
};

3.2 arguments to directives

3.2.1 meta-variables

  • Several constructs use meta-variables of the form ${NAME}.
  • These are replaced in the processed source with their values.
  • The replacement can be modified with # and ## operators, similar to what happens in the C preprocessor.
    • # is "stringification" and
    • ## merges to tokens to the left or to the right.
Examples:
do, foreach, env, bind, let, …
  • For a more detailed discussion have a look into "regVar".

3.3 amend.cfg

3.3.1 amend.cfg a list of approved directives

  • shnell does not allow arbitrary code to be executed.
  • Directives and shnl files have to be approved.
  • The file amend.cfg shows all directives that are available.
  • To add a new directive to the tool box:
    • install them in a specific directory
    • add an entry to amend.cfg.

4 Directives, how the implementor sees them

4.0.1 sh(n)ell programming

  • A directive
    • receives the code that it has to treat on stdin and
    • sends the modified code to stdout.
  • The surrounding tasks of
    • cutting the code out of context and
    • reinserting the result in place is done by shnell.

4.0.2 tokenization

When using the shnell executable on a given C code prog.c,

  • a tokenizer splits the program into tokens as defined by the C language.
  • All intermediate tools see isolated tokens
    • identifiers
    • keywords
    • numbers
    • punctuators
    • strings
    • comments
    • … and a lot of control characters and white space
  • no further lexical analysis required

4.0.3 detokenization

At the end, after all transformation have been applied

  • the tokenization is reverted
  • the original line and spacing structure reappears.
  • reader friendly:
    • keep or inspect intermediate steps of a transformation.
  • compiler friendly:
    • code is annotated with #line directives
    • trace errors back to the source file.

4.0.4 Shell modules

The whole toolbox is by itself constructed from reusable pieces:

  • regexp matching (match),
  • temporary files with garbage collection,
  • split and join,
  • hash tables and
  • … many more.

4.0.5 import.sh

This has to be explicitly sourced with a magic line:

SRC="$_" . "${0%%/${0##*/}}/import.sh"

And then other shell modules can be imported as this:

import arguments
import tmpd
import tokenize
import match

4.0.6 … with documentation

terminate such an import and documentation section by a line

endPreamble $*

If run by itself as an executable script with an option --html will then extract documentation for the module.

5 Terms

5.0.1 Copyright, license and distribution

  • Copyright © 2015-2020 Jens Gustedt
  • The shnell project is licensed under a standard MIT license
----------------------------------------------------------------------
Copyright © 2015-2020 Jens Gustedt

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
----------------------------------------------------------------------
  • This work is distributed at

http://shnell.gforge.inria.fr

  • The sources can be found at

https://gforge.inria.fr/projects/shnell

Author: Jens Gustedt

Created: 2020-02-26 Mi 11:24

Validate