Hague: A Story of Macro expansions

One of the greatest challenges and joys of learning how to program is being able to take another person’s code and walk through figuring out what it does and how it does it; however, this can be difficult when the code is poorly written. So what happens when you try to decode someone’s project that was intentionally written bad? This is the question I got to answer when I had to tackle code submitted to the IOCCC, or The International Obfuscated C Code Contest. The IOCCC, https://www.ioccc.org/, is a contest to show the importance of coding style in the best way possible, by breaking just about every standard style rule (within a given limit) while still keeping the code functional. The code in question for today’s task belonged to the winner of the 1986 Worse abuse of the C pre-processor, Jim Hague. Hague’s code takes input directly from the user and converts it to Morse Code. While this might seem like a simple enough problem, his solution code utilized user defined macro’s to make it about as convoluted as you can get, as shown below:

#define DIT (
#define DAH )
#define __DAH ++
#define DITDAH *
#define DAHDIT for
#define DIT_DAH malloc
#define DAH_DIT gets
#define _DAHDIT char
_DAHDIT _DAH_[]="ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e'b.s;i,d:"
;main DIT DAH{_DAHDIT
DITDAH _DIT,DITDAH DAH_,DITDAH DIT_,
DITDAH _DIT_,DITDAH DIT_DAH DIT
DAH,DITDAH DAH_DIT DIT DAH;DAHDIT
DIT _DIT=DIT_DAH DIT 81 DAH,DIT_=_DIT
__DAH;_DIT==DAH_DIT DIT _DIT DAH;__DIT
DIT'\n'DAH DAH DAHDIT DIT DAH_=_DIT;DITDAH
DAH_;__DIT DIT DITDAH
_DIT_?_DAH DIT DITDAH DIT_ DAH:'?'DAH,__DIT
DIT' 'DAH,DAH_ __DAH DAH DAHDIT DIT
DITDAH DIT_=2,_DIT_=_DAH_; DITDAH _DIT_&&DIT
DITDAH _DIT_!=DIT DITDAH DAH_>='a'? DITDAH
DAH_&223:DITDAH DAH_ DAH DAH; DIT
DITDAH DIT_ DAH __DAH,_DIT_ __DAH DAH
DITDAH DIT_+= DIT DITDAH _DIT_>='a'? DITDAH _DIT_-'a':0
DAH;}_DAH DIT DIT_ DAH{ __DIT DIT
DIT_>3?_DAH DIT DIT_>>1 DAH:'\0'DAH;return
DIT_&1?'-':'.';}__DIT DIT DIT_ DAH _DAHDIT
DIT_;{DIT void DAH write DIT 1,&DIT_,1 DAH;}

The terrible spacing aside, this code seems like its own language unto itself even beyond the realm of C, but compliers at the time were able to run it and produce a functional program. For today’s task I decided to try to translate it by hand, and to be completely honest at first it was a mountain of a task that I was only able to over come thanks to the help of two of my peers, Kelsie Merchant (found at https://www.linkedin.com/in/kelsie-merchant-physics/)
and Benjamin Dosch (found at https://www.linkedin.com/in/benjamin-dosch-872a4731/).

My first issue came in the form of technical difficulties where my terminal and file editor exaggerated the spaces in the code above, which at first I did not stress too much; however, when I attempted to compile and run the program all I got was a blank line (at this point in time I had not discovered the gets() function in the code which requests input directly from the user terminal) and assumed the spacing was causing issues. After reaching out to Kelsie, who is one of my student tutors at Holberton Tulsa, she recommended I just take the raw code and try to translate it by “hand” so that I can see what is going on. This is where I ran into my second obstacle, I accidentally over translated the code, producing a rather confusing mess.

main()
{
Char *_(,*)_, * (_, * _(_,*malloc (), * gets ( );
for( _(=malloc(81),(_=_(++;_(==gets ( _( );__((‘\n’) )for( )_=_(;*)_;__(( *_(_?_)( *(_ ):’?’),__((‘ ‘),)_ ++) for(*(_=2,_(_=_DAH_; *_(_&&(* _(_!=( * )_>=’a’? *)_&223:* )_ ) );(*(_ ) ++,_(_++ )* (_+= ( * _(_>=’a’?*_(_-’a’:0);}_) ( (_ ){ __( ((_>3?_) ( (_>>1 ):’\0');
return(_&1?’-’:’.’;}__( ( (_ ) char(_;{( void ) write (1,&(_,1 );
}

After consulting with Kelsie once more I realized that while taking the user defined macro’s I had accidentally ignored the fact that _DIT is different from DIT without the underscore. Let us freeze right here and take a look at what I am talking about. In the upper most code we have several lines that say

#define MACRO data

This is called a macro and is a way to create an abbreviation to tokens in C. Normally the pre-processor will take all your macros and expand them inside your actual code, so MACRO would become data, or in Hague’s code all of his stand alone DIT became ‘(‘ the DAH became ‘)’ __DAH (double underscore) became ++ and so on. We can produce similar code running

gcc filename.c -E -o hague

where the code with all the expanded macros from our .c file would become the file called hague, which reads as follows:

# 1 "hague.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 31 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<command-line>" 2
# 1 "hague.c"
# 9 "hague.c"
char _DAH_[]="ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e'b.s;i,d:"
;main ( ){char
* _DIT,* DAH_,* DIT_,
* _DIT_,* malloc (
),* gets ( );for
( _DIT=malloc ( 81 ),DIT_=_DIT
++;_DIT==gets ( _DIT );__DIT
('\n') ) for ( DAH_=_DIT;*
DAH_;__DIT ( *
_DIT_?_DAH ( * DIT_ ):'?'),__DIT
(' '),DAH_ ++ ) for (
* DIT_=2,_DIT_=_DAH_; * _DIT_&&(
* _DIT_!=( * DAH_>='a'? *
DAH_&223:* DAH_ ) ); (
* DIT_ ) ++,_DIT_ ++ )
* DIT_+= ( * _DIT_>='a'? * _DIT_-'a':0
);}_DAH ( DIT_ ){ __DIT (
DIT_>3?_DAH ( DIT_>>1 ):'\0');return
DIT_&1?'-':'.';}__DIT ( DIT_ ) char
DIT_;{( void ) write ( 1,&DIT_,1 );}

While the spacing on this code is still pretty bad, we can finally see the framework of actual c language code and begin to determine just how it works, and we can see where I went wrong in my initial translation. Hague had defined several variables: _DIT, DAH_, DIT_, and _DIT_; but in my haste I had converted these variables into further ‘(‘ and ‘)’ removing the essence of the code along with it. Now that I was back on the right track I went about cleaning up the code. At first I began my own substitutions: while the complier could read this without problems all the DIT DAH and underscores began to blend together so I created my own further (but this time proper) translations of Hague’s variables to increase readability.
_DIT(str1)
DAH_ (str2)
DIT_(str3)
_DIT_(str4)
And after I did my initial translations of the variables I noticed three outliers within the remaining code: _DAH_, which I later remembered was another macro Hague had defined as an array, _DAH (single underscore) and __DIT(double underscore). Further the code’s spacing was still pretty bad at this point, and so I approach my Holberton peer Ben and together we set about separating the code based on standard formatting practices. What we eventually ended up with was this:

char _DAH_[]="ETIANMSURWDKGOHVFaLaPJBXCYZQb54a3d2f16g7c8a90l?e'b.s;i,d:";main ( ){
Char *str1, *str2, *str3, *str4, *malloc ( ), *gets ( );
for ( str1 = malloc( 81 ) , str3 = str1++; str1 == gets(str1); __DIT('\n') )for ( str2 = str1; *str2; __DIT( *str4 ? _DAH ( *str3 ) : '?' ), __DIT(' '), str2++ )for (*str3 = 2, str4 = _DAH_; *str4 && (*str4 != ( *str2 >= 'a' ? *str2 & 223 : *str2 ) ); (*str3 )++, str4++ )
*str3 += ( *str4 >= 'a' ? *str4 - 'a' : 0);
}
_DAH(str3)
{
__DIT ( str3 > 3 ? _DAH ( str3>>1 ) : '\0');
return str3 &1 ? '-' : '.' ;
}
__DIT (str3)
char str3;
{
(void) write ( 1, &str3,1 );
}

By this point we had realized that __DIT and DAH_ were functions separate from the main function, and the code finally began to become clearer. The array _DAH_ was simply the Morse Code binary tree as shown at the top of the page as well as below for convenience written top to bottom right to left:

__DIT looked familiar almost immediately, it was the source code for the function putchar() and was how Hague was writting the Dits (also known as ‘.’) and the Dah’s (also known as ‘-’), which also cleared up why Hague chose those seemingly arbitrary names for all his macros, variables, and functions. It was also at this point we realized some of the syntax that made little to no sense to us, such as __DIT(str3) char str3, was actually the product of the time: it was old K&R C where in a function declaration you put the variables in the parenthesis and then defined the data type afterwards. DAH_ is a recursive function that goes through the text provided by gets() in main and would produce the morse code before sending it to __DIT to be printed. So E would become . , A would become .- , etc., with letters being separated in the output by spaces, special characters such as ‘,’ being printed outright, and spaces being translated to ? in the output to separate words. This program would run until the code either ran out of memory, running exit code 9, or until the user manually killed the code since the initial for loop lacked a natural terminating clause, allowing the user to input string after string to be translated to morse code.

Example of code running with the original mess and example output

At this time my understanding of old K&R C ternary operators is lacking to the point that I can not give a step by step of what happens as the string goes from user input, is taken apart and eventually printed as dits and dahs, but that is probably the second greatest joy of learning to program, being able to go back to old code and further expand upon it by making it more efficient, or in the case of reviewing other programmer’s code being able to better understand exactly what is happening as you go through step by step, and I definitely look forward to coming back and updating this article with the more in-depth look at the final translated code.