shnell – source-to-source compiler enhancement

Table of Contents

Jens Gustedt

1 Overview

1.0.1 Takeaway

We provide a tool to easily develop and prototype compiler and language enhancements that can be expressed by source-to-source transformation of C code. It uses #pragma directives to

  • mark parts or all of a code,
  • cut out that code,
  • pipe it through a transformation and
  • splice it back into the same place.

1.0.2 What we get

Many small but convenient directives are already available, such as

1.0.3 How it is done

It is based on two major tools of POSIX systems:

  • shell programming (sh)
  • regular expression streaming (sed)

1.0.4 How it is used

  • The existing features can be used in daily programming without knowledge of these tools.
  • Filter programs that implement the directives can be written in any other programming language that suits the task:

perl, python, java, C itself …

  • Shnell features are easily applied by either

2 Introduction

2.1 a simple example

2.1.1 Example: code unrolling …

  • shnell performs source-to-source transformations
  • identify code ranges with the help of directives.
Example:
a declaration of an array A
double A[] = {
#pragma CMOD amend foreach ANIMAL:N = goose dog cat
   [${ANIMAL}] = 2*${N},
#pragma CMOD done
};

2.1.2 … and its replacement

  • the #pragma ensure that the inner line is copied three times
  • "meta-variable" ${ANIMAL} iterates over

"goose", "dog" and "cat"

  • ${N} holds the number of the current copy, starting at 0
double A[] = {
   [goose] = 2*0,
   [dog] = 2*1,
   [cat] = 2*2,
};

2.1.3 stringification

  • A similar code can use the "stringified" parameters
char const* names[] = {
#pragma CMOD amend foreach ANIMAL = goose dog cat
   [${ANIMAL}] = #${ANIMAL},
#pragma CMOD done
};
  • results in
char const* names[] = {
   [goose] = "goose",
   [dog] = "dog",
   [cat] = "cat",
};

2.2 general approach

In the general case such a directive is identified

  • with a tag, here "CMOD" (but that can be modified),
  • with a rule, here "amend", to say that the code up to the next done is changed,
  • with a directive, here "foreach", that names the script that is to, be run
  • and with a list of arguments to the directive, here the five tokens "ANIMAL = goose dog cat"

2.3 identifier im- and export

2.3.1 Example: identifier export

#pragma CMOD amend export string : funny string
/* secret has no linkage, not exported */
enum secret {
  /* but size and length become external names */
  string::size = 32,
  string::length = size-1,
};
struct string {
  char s[length];
};
/* external linkage! */
_Thread_local string myown;
/* INIT becomes an external name */
#define string::INIT(S) { .s = S, }
#define string::EMPTY ((string)INIT(""))

2.3.2 … its replacement as a .c

/* secret has no linkage, not exported */
enum secret {
  /* but size and length become external names */
  funny_string_size = 32,
  funny_string_length = funny_string_size-1,
};
typedef enum secret secret; /* convenience */
typedef struct funny_string funny_string; /* convenience */
struct funny_string {
  char s[funny_string_length];
};
/* external linkage! */
_Thread_local funny_string funny_string_myown;
#define funny_string_INIT(S) { .s = S, }
#define funny_string_EMPTY ((funny_string)funny_string_INIT(""))

2.3.3 … and the view by others (.h) …

/* secret has no linkage, not exported */
enum some_unguessable_name_for_secret {
  /* but size and length become external names */
  funny_string_size = 32,
  funny_string_length = funny_string_size-1,
};
typedef struct funny_string funny_string; /* convenience */
struct funny_string {
  char s[funny_string_length];
};
/* external linkage! */
extern _Thread_local funny_string funny_string_myown;
#define funny_string_INIT(S) { .s = S, }
#define funny_string_EMPTY ((funny_string)funny_string_INIT(""))

2.3.4 Example: implicit import …

#pragma CMOD amend implicit : funny string

int main(int argc, char* argv[argc+1]) {
   string* myownp = &string::myown;
   string little = string::INIT("little");
   stdc::printf("my string is %s\n", little.s);
}

2.3.5 … and replaced.

#include "stdc.h"
#include "funny-string.h"

int main(int argc, char* argv[argc+1]) {
   funny_string* myownp = &funny_string_myown;
   funny_string little = funny_string_INIT("little");
   printf("my string is %s\n", little.s);
}

3 Command line tools

3.0.1 shnell

  • The central tool is shnell:
    • reads a C file
    • processes it
    • dumps the result to stdout.
  • If you want to keep track of the intermediate code, this would be your tool of choice.
  • Easy to integrate into a compilation chain by providing make rules

3.0.2 executable dialects

To avoid

  • to keep track of the modified sources
  • to apply the same set of directives to each source file
shnl files
that group directives together and create something like dialects of the C language, see load
compiler prefixes
that can be used to apply such a dialect and all included directives to a source and to run your favorite compiler directly on the result. Current such prefixes are shneller, trade and posix.

3.0.3 example: using trade

Example:
apply the TRADE policy to a source file toto.c during compilation
trade gcc -Wall -c -O3 -march=native toto.c
  • we prefix the compiler command line by the command "trade".
    • This filters the file name and task from the command line.
    • It performs the source-to-source rewriting.
    • It compiles the result to an object file toto.o.

3.0.4 example: using trade

  • Similarly, without -c
trade gcc -Wall -O3 -march=native toto.c mind.o mund.o
  • Takes the first source (toto.c)
  • Does all of the above.
  • Links all the objects into an executable "toto" if possible.

3.0.5 example: using trade

  • If there are only .o files:
trade gcc -Wall -O3 -march=native toto.o mind.o mund.o
  • Only the linker phase is performed.

3.0.6 example: using trade

  • Command line flags that are understood:
    • -c compile to object .o file
    • -E to perform all rewriting and preprocessing
    • -S to produce an assembler file
    • -M to produce nothing but the side effects of compilation such as the header file (see export).

4 Directives, how the C programmer sees them

4.1 amend, insert and load directives

4.1.1 scans

  • Directives are found by one or several scans that shnell performs on a C source.
  • Directives come in three different flavors, that can induce several such scans

4.1.2 amend

  • The scope is up to
    • the next (nesting) "done" directive or
    • the end of the source file.
  • The code is piped into the command.
  • The result is inserted in place.
  • The command also receives the argument list of the directive over some side channel.
  • amend directives typically modify the code
Example:
foreach as above
  • the code is repeated several times
  • each copy is modified by resolving the meta-variables

4.1.3 insert

  • Has no scope.
  • The command does not receive input.
  • It only receives the arguments.
  • The result is inserted in place.
  • Scan of the source file then continues directly after.
  • The inserted code has no influence on shnell for this scan.
  • insert directive typically just puts some declarations or definitions in place.
Example:
enum directive
  • Defines an enumeration type and some depending functions.
#pragma CMOD insert enum animal = goose dog cat

4.1.4 load

  • Inserts a set of directives that are found in shnl files.
    • shnl files condensate complicated patterns
  • Scanning continues from the top of the inserted lines.
  • They may contain several amend directives that are not terminated.
  • Possibly multiple scans of the whole source file.
Example:
CONSTEXPR directive
  • Allows to have several nested "evaluations" of variables.
  • The result are expressions that are evaluated at compile time.

4.1.5 recursion

  • Nested occurrence of amend directives leads to finite recursion.
Example:
Two nested do directives:
double A[3][2] = {
#pragma CMOD amend do I = 3
    [${I}] = {
#pragma CMOD amend do J = 2
       [${J}] = 2*${I} + ${J},
#pragma CMOD done
    },
#pragma CMOD done
};
  1. start collecting the code SI immediately after the first do
  2. when collecting SI, the second do directive is encountered
  3. collection of the code SJ starting after that second do is started, until the first done is encountered. SJ now has:

    [${J}] = 2*${I} + ${J},
    
  4. SJ is fed into the do directive for variable J and value 2.
  5. The do directive replicates SJ twice and replaces all occurences of ${J} by 0 and 1, respectively, to obtain a code TJ:

    [0] = 2*${I} + 0,
    [1] = 2*${I} + 1,
    
  6. TJ is inserted into SI instead of the directive, resulting in a replaced code RI.
  7. The scan for the first directive is continued until the second done is encountered, resulting in a code QI:

    [${I}] = {
       [0] = 2*${I} + 0,
       [1] = 2*${I} + 1,
    },
    
  8. QI is fed into the do directive for variable I and value 3.
  9. The do directive replicates QI three times and replaces all occurences of ${I} by 0, 1, and 2, respectively, to obtain a code TI:

    [0] = {
       [0] = 2*0 + 0,
       [1] = 2*0 + 1,
    },
    [1] = {
       [0] = 2*1 + 0,
       [1] = 2*1 + 1,
    },
    [2] = {
       [0] = 2*2 + 0,
       [1] = 2*2 + 1,
    },
    
  10. TI is then inserted in place of the whole #pragma construct.

So after completion of the inner directive, after step 5, the code as if we had written:

double A[3][2] = {
#pragma CMOD amend do I = 3
    [${I}] = {
       [0] = 2*${I} + 0,
       [1] = 2*${I} + 1,
    },
#pragma CMOD done
};

Only then the outer directive is applied and the over all result after step 10 is

double A[3][2] = {
    [0] = {
       [0] = 2*0 + 0,
       [1] = 2*0 + 1,
    },
    [1] = {
       [0] = 2*1 + 0,
       [1] = 2*1 + 1,
    },
    [2] = {
       [0] = 2*2 + 0,
       [1] = 2*2 + 1,
    },
};

4.2 arguments to directives

4.2.1 meta-variables

  • Several constructs use meta-variables of the form ${NAME}.
  • These are replaced in the processed source with their values.
  • The replacement can be modified with # and ## operators, similar to what happens in the C preprocessor.
    • # is "stringification" and
    • ## merges to tokens to the left or to the right.
Examples:
do, foreach, env, bind, let, …
  • For a more detailed discussion have a look into "regVar".

4.3 amend.cfg

4.3.1 amend.cfg a list of approved directives

  • shnell does not allow arbitrary code to be executed.
  • Directives and shnl files have to be approved.
  • The file amend.cfg shows all directives that are available.
  • To add a new directive to the tool box:
    • install them in a specific directory
    • add an entry to amend.cfg.

5 Directives, how the implementor sees them

5.0.1 sh(n)ell programming

  • A directive
    • receives the code that it has to treat on stdin and
    • sends the modified code to stdout.
  • The surrounding tasks of
    • cutting the code out of context and
    • reinserting the result in place is done by shnell.

5.0.2 tokenization

When using the shnell executable on a given C code prog.c,

  • a tokenizer splits the program into tokens as defined by the C language.
  • All intermediate tools see isolated tokens
    • identifiers
    • keywords
    • numbers
    • punctuators
    • strings
    • comments
    • … and a lot of control characters and white space
  • no further lexical analysis required

5.0.3 detokenization

At the end, after all transformation have been applied

  • the tokenization is reverted
  • the original line and spacing structure reappears.
  • reader friendly:
    • keep or inspect intermediate steps of a transformation.
  • compiler friendly:
    • code is annotated with #line directives
    • trace errors back to the source file.

5.0.4 Shell modules

The whole toolbox is by itself constructed from reusable pieces:

  • regexp matching (match),
  • temporary files with garbage collection,
  • split and join,
  • hash tables and
  • … many more.

5.0.5 import.sh

This has to be explicitly sourced with a magic line:

SRC="$_" . "${0%%/${0##*/}}/import.sh"

And then other shell modules can be imported as this:

import arguments
import tmpd
import tokenize
import match

5.0.6 … with documentation

terminate such an import and documentation section by a line

endPreamble $*

If run by itself as an executable script with an option --html will then extract documentation for the module.

6 Install and usage

6.0.1 Compilation

  • shnell is almost entirely implemented in script languages
  • only one tiny bit still needs compilation, bin/isatty.so
  • If you also want the optional complete Unicode support you should also compile the tools-c/ directory.
  • To additionally test shnell compile the code in complements/
  • All theses steps can be launched by

    make -j N

    where N is the number of cores of your systems or less.

6.0.2 Installation

  • shnell can be used from anywhere where it is installed
  • The scripts locate the directory in which they reside and look for other components (shnl/, legacy/ and tools-c/) relative to that.
  • For example, to operate from /usr/local
    /usr/local/bin
    /usr/local/shnl
    /usr/local/legacy
    /usr/local/tools-c
    (optional, binaries only)
  • copy the corresponding contents, there.

6.0.3 Usage

Any of the following should work

  • Use an absolute path name to refer to the tool that you are using.
  • Add the bin/ directory to your PATH environment variable.
  • Use a system-wide place to install the binaries and other directories as indicated above.

6.0.4 Development of directives

  • Development of directives should take place in your private copy
  • To add a new directive
    • write your filter
    • install your filter in bin/
    • add your filter to amend.cfg
  • To add a new executable dialect
    • test your idea in a example
    • move the #pragma you need to the top of the example
    • write an .shnl file (e.g TOTO.shnl) that comprises these #pragma
    • copy that .shnl file to shnl/
    • in bin/, establish a softlink totoshneller
  • Share your developments with others if you may!

6.0.5 Copyright, license and distribution

  • Copyright © 2015-2020 Jens Gustedt
  • The shnell project is licensed under a standard MIT license
----------------------------------------------------------------------
Copyright © 2015-2020 Jens Gustedt

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
----------------------------------------------------------------------
  • This work is distributed at

https://gustedt.gitlabpages.inria.fr/shnell/

  • The sources can be found at

https://gitlab.inria.fr/gustedt/shnell

Author: Jens Gustedt

Created: 2020-11-13 Fr 19:03

Validate