The Dada Engine is a system for the generation of text from a specification. The specification is a text file containing rules in the form of a (somewhat augmented and/or bogotified) grammar. These files are fed to the Dada Engine, which handles (or evaluates) them and emits the resulting text.
The Dada Engine is based on the principle of recursive transition networks, or recursive grammars. A recursive transition network (RTN) is a diagram which shows how a task, such as the construction of a sentence, may be carried out. A RTN consists of paths which may be followed and of operations along these paths which must be carried out. For example, the following RTN defines how to construct one type of sentence:
sentence: +-----------+ +>| adjective |>+ +-------------+ | +-----------+ | +------+ +------+ (start)->-| preposition |->-+------->-------+>| noun |--| verb |-(end) +-------------+ +------+ +------+
If one follows the network from the start to the end, one passes through boxes representing the various elements which make up the sentence in sequence: first a preposition, then an optional adjective, then a noun and then a verb. After the preposition, the network branches into two paths, one of which leads to the adjective and the other which bypasses it and goes straight on to the noun. One may take either path in the course of constructing a sentence. Nouns, verbs and the like could be represented by similar RTNs; a RTN for an adjective follows:
adjective: +---------+ +---| "large" |--+ | +---------+ | | +---------+ | (start) -->--+---| "green" |--+-->-- (end) | +---------+ | | +---------+ | +---| "round" |--+ +---------+
These recursive transition networks are represented in text as rules; the two networks above would be written as:
sentence: pronoun [ adjective | "" ] noun verb ; adjective: "large" | "green" | "round" ;
These would be part of the grammar defining the space of possible sentences.
A recursive transition network may call itself, either directly or indirectly. Thus, things like the following are possible:
verb: "said `" sentence "'" ; adjective: "large" | "green" | "round" | "very " adjective ;
The first rule above calls the sentence
rule which called it;
the second rule may, as one of its options, call itself, prepending
the word "very" to its output.
(Note that in reality one would not write a pair of rules such as
the sentence
and verb
rules above; in this case, the two
rules would keep calling each other and never finish. In reality, one
should define more alternatives to break the cycle.)
To define a system of RTNs or rules to the Dada Engine, write it in
the notation above (the language known as pb
) in an ASCII text file;
this file is known as a script.
As well as recursive transition networks, a Dada Engine script may contain
other practical functions, such as mappings, transformations and variables.
See Chapter 2 for more information.
When you start the Dada Engine, you do so with the dada
command,
like so:
% dada myscript.pb
The command dada
first pipes your script (in this case
`myscript.pb') through the C preprocessor; this allows you to include
preprocessor macros, conditional definitions and external libraries of
rules. The output of the preprocessor is then sent to pb
, a program
which reads in and evaluates the rules.
The pb
command reads the complete input before it emits anything.
Once it has read everything, it evaluates the initial rule (also known as
the start symbol); in turn, this causes any rules invoked from the
initial rule to be evaluated.
Unless a start symbol is specified on the command line,
the first rule read that is not defined as a resource rule
becomes the start symbol.
As the rules are evaluated, the text generated is appended to a buffer in memory; once everything has been evaluated, the buffer's contents are emitted.
The dada
command accepts several command-line options.
See the UNIX manual page supplied for details.
Comments in pb are handled as in C++ and Java. A comment of any length may be enclosed between the sequence `/*' and the sequence `*/'. Alternatively, one-line comments may be preceded by `//'. The following are valid comments:
/* This is a two-line comment. See? */ // You are not expected to understand this.
An identifier in pb is any sequence of letters, digits, underscores and hyphens which starts with a letter or underscore. Identifiers are case-sensitive.
The following are valid identifiers:
futplex sort-of-vague-ambiguous-noun _298R
A literal string in pb is any sequence of characters embedded in quotes. Unlike in C, a string may contain unquoted newlines. The backslash character serves a similar function as in C. The following sequences have special meanings:
\\"
\\n
\\t
\\v
\\b
\\r
\\f
\\a
Atoms are the building blocks of rules. Each atom is a self-contained element which performs a function, usually generating text.
Atoms are divided into several types. These are:
A literal text atom consists of a sequence of characters enclosed by double quotes. Double quotes, newlines and tabs may be included in literal text using C-style backslash escapes.
Examples:
"foo" "\"Come into my parlour,\" said the spider to the fly." "Friends, Romans, Countrymen,\nLend me your ears"
A symbol is an atom which names a rule to be evaluated. A symbol name may be any sequence of letters, digits, underscores and hyphens with the restriction that the first character must be a letter or an underscore. Symbol identifiers are not enclosed by quotes. The following are examples of possible symbol names:
foo noun2 abstract-adjective
A pb rule may contain blocks of embedded code. This code allows expressions to be evaluated and variables to be updated. See section Embedded Code for details.
A block of embedded code is enclosed by curly brackets.
In pb, the form of a rule declaration is:
rule:: <rulename> `:' <options> `;' options:: <options> '|' <option> | <option> option:: | <atom> <option>
rulename is a symbol which is assigned to the rule.
If a rule has several options, one will be selected randomly at the time of evaluation. The atoms of the chosen option will be evaluated in sequence and their results joined into a string, which shall be returned as the result of the rule.
For example, in the following grammar:
sentence: "The ball is " colour "." ; colour: "red" | "green" | "blue" ;
colour
will evaluate to either "red', "green" or "blue"; hence,
sentence
will evaluate to "The ball is red.", "The ball is green.", or
"The ball is blue."
If a rule has several options, the same option will never be chosen twice in immediate succession.
pb allows you to use variables. These may be used to store text (or, usign embedded code, integers). The following atom operators may be used to handle variables:
var=atom
atom
, setting the variable named var
to the result,
and return the result.
var<<atom
var
is not set, evaluate atom
and set var
to the result, returning it; else return the present value of var
.
$var
var
.
To allow grammars to be written more simply, pb includes shortcuts for repeatedly invoking rules. When an atom is followed by the character * (also known as the Kleene star), pb will evaluate this atom zero or more times (determined at random) and return the results, concatenated. Thus, the rule:
foo: bar* ;
has a notionally similar function to :
foo: bar foo | /* nothing */ ;
The + operator has a similar function; in this case, however, the atom is evaluated at least once. The following rule definition:
foo: bar+ ;
has a notionally similar function to :
foo: bar foo | bar ;
A parametric rule is a rule which, when evaluated, takes parameters. This rule is declared with one or more parameter names in brackets after its name, and is invoked in a similar fashion. When the rule is evaluated, the parameters with which it is called are bound to the formal parameters, and references to the formal parameters are replaced with them. For example, if the following rule is defined:
i-am(my-name): "Hello; my name is " my-name "." ;
Then if it is invoked with the atom i-am("Bob")
, it will yield the
result:
-| "Hello; my name is Bob."
If a rule has several parameters, they are separated with spaces.
Note that in the current release there are some restrictions on the use of parameters in expressions; parameters cannot be accessed from embedded code or in inline choices.
As a short-cut, pb allows random choices to be declared inline in a rule, without declaring a separate rule for them. For instance, the rule declaration:
foo: bar [ xyzzy | plugh ] baz ;
has the same result as the following two rules:
foo: bar quux baz ; quux: xyzzy | plugh ;
only the second form defines an additional rule, quux
, to handle the
choice.
Inline choices should be used where a 'once-off' choice is needed; in such cases, they help to eliminate cluttering of the rule-base and make the script more readable. If a choice will be needed more than once, it is a better idea to declare it as a rule.
By default, when the Dada Engine evaluates a script, it resolves the first rule it sees. If you include a file at the start of the script using the C preprocessor, and this file contains rules, the first rule in the file will be selected as the initial rule (or start symbol). This, in most cases, is inconvenient.
To bypass this, the Dada Engine allows you to declare rules as resource rules. A resource rule is a rule which will not become the start symbol, even if the Dada Engine sees it before any other. Resource rules are declared by prepending their names with the `%resource' keyword; for example, the following code fragment:
%resource my_name: "Bob" ; output: "Hi! I'm " my_name "!" ;
will print `Hi! I'm Bob!'. The initial rule is `output', because `my_name' is flagged as being a resource and thus not eligible to be the initial rule.
(Note that there's nothing preventing the user from explicitly invoking a resource rule as the initial rule by specifying it from the command line with the `-s' switch.)
An atom may be silenced by preceding it with a question mark. For
example, to assign the value "blah" to the variable foo
without
outputting it, one could write ?foo="blah"
. The effect is illustrated
by the two sample rules below:
foo: "the" colour=" green" " dragon was" $colour ; foo: "the" ?colour=" green" " dragon was" $colour ;
The first rule would output "the green dragon was green", whereas the
second rule would output "the dragon was green", as the assignment of
" green" to colour
is silenced and generates no output.
Indirection allows the output of a rule to be used as the name of another rule. This is useful when the ranges of valid choices are influenced by a prior choice. For example, the following script:
start: sentence-about(animal) ; animal: "dog" | "cat" ; sentence-about(subject): @subject " is a " subject ; dog: "Fido" | "Spot" ; cat: "Tiddles" | "Fluffy" ;
may produce the sentences "Spot is a dog" or "Fluffy is a cat", but will
never produce "Spot is a cat". When sentence-about
is evaluated,
subject
is set to either "dog" or "cat"; when the first atom is
evaluated, pb uses it as the name of a rule and either evaluates the rule
named dog
or the one named cat
.
There are two methods for transforming text: mappings and
transformations. Both are applied to an atom with the construct
atom>func
, where func
is the name of the mapping or
transformation.
A mapping allows you to transform strings from one set into strings from another. Mappings use regular expressions, and thus one or more strings may be mapped to a string. In addition, mappings may be used to perform regular substitution on strings.
Mappings take the following form:
mapping:: <mapping name> ':' <mapping options> ";" mapping options:: <mapping options> <mapping option> | <mapping option> mapping option:: <source> "->" <destination> | <string1> "<-> <string2> | <pattern> "->" <from>"/"<to>
There are three forms of options; the first form maps a string or set of strings to a string, but not vice versa.
The second form maps bidirectionally between two strings; the option:
"foo" <-> "bar"
is equivalent to the two options:
"foo" -> "bar" "bar" -> "foo"
The third form is used to perform simple sed-style regular expression substitutions. The substitution is performed on any string matching the regular expression to the left of the arrow; in it, all substrings matching the regular expression in the second string are replaced with copies of the third string.
Below are some examples of mappings:
// map several personal names to the appropriate pronouns gender-pronoun: "Alice" -> "she" "Bob" -> "he" "Carol" -> "s/he" // Carol is an ambiguous name ; // map words and their opposites to each other opposite: "dark" <-> "light" "large" <-> "small" "true" <-> "false" ; // a rule for deriving the plurals of words. Works in some cases. pluralise: ".*y$" -> "y$"/"ies" ".*s$" -> "$"/"es" ".*" -> "$"/"s" ;
Transformations are defined as follows:
transformation:: `%trans' <transname> `:' <transform options> ';' transform options:: <transform option> <transform options> | <transform option> transform option:: <regex> `:' [address] <command> `;' address:: <number> [',' <number>]
regex is a literal string containing a regular expression to whose occurrences the transformation is to be applied.
address consists of a range of character positions in which positive numbers refer to displacements from the start of the string and negative ones from the end; -1 refers to the last character of a string.
Each command may be followed by zero or more arguments, each of which is a quoted string. The available commands are:
d delete the text in the address range l convert the text in the address range to lower case s "from" "to" replace all occurrences of the regular expression `from' with the string `to' u convert the text in the address range to upper case
In pb, blocks of code may be embedded in rules. These may return a result or not.
A block of code consists of one or more statements, separated with semicolons. At present, there are only two types of statements: assignments and returns. An assignment evaluates an expression containing either integer literals or integer variables and assigns the result to a variable. A return statement evaluates an expression as above and exits the block of code, returning the result.
In embedded code, an expression is a combination of operations which, when evaluated, yields a value.
pb's embedded code supports two types of expressions: Numeric and textual expressions. Numeric expressions are expressions whose terms are integers. Textual expressions are expressions whose terms are sequences, or strings, of text.
If an expression contains terms which evaluate to both integers and text, the integer terms are first converted to text and it is treated as a textual expression. Likewise, if the result of an expression is being returned to the enclosing grammatical rule, it is first converted to text.
Operator precedence in arithmetic expressions is as in C, with multiplication, division and modulo taking precedence over addition and subtraction, and all operators being left-associative. Brackets may be used to group terms.
The terms of an expression may be other expressions, literal integers, literal strings (in double quotes) or variable names.
The following are valid expressions:
1+1 ((foo*2)-1)/bar "two plus three equals "+(2+3)
The last expression evaluates to the text string, "two plus three equals 5".
In pb, numeric expressions may involve the addition (+), subtraction (-), multiplication (*), division (/), integer modulo (%), random number (..), lesser (<<) and greater (>>) operators.
The addition operator adds two integers; for example, 2+3
would
evaulate to 5
. Similarly, the subtraction operator subtracts two
integers, and the multiplication and division operators multiply and
divide their arguments respectively. The modulo operator evaluates to the
remainder of dividing its first argument by its second; for example,
5%2
would yield 1
.
The random operator selects at random an integer from a range between two
bounding values. It is written as a..b
, where a
is the lower
bound and b
is the upper. For example, the expression 2..5
would
yield either 2, 3, 4 or 5, chosen at random.
The lesser operator (written as <<
) yields the lesser of its two
arguments; for instance, 2<<3
would evaluate to 2, as would
3<<2
. Similarly, the greater operator (written as >>
) yields the
greater of its two arguments.
The concatenation operator, +, may be used in expressions to join strings
of text; for example, "foo" + "bar"
would yield, "foobar"
.
Additionally, rules may be invoked in expressions, yielding string terms; to add a rule invocation as a term, add the rule name preceded by a '@'. The following rules yield equivalent results:
sentence: "The " noun " is green." ; sentence: "The " {=@noun} " is green." ; sentence: {="The " + @noun + " is green." } ;
The first sentence invokes the noun
rule directly; the second invokes
it in a code block, returning its result as the result, which is then strung
together with the literal strings "The "
and "is green."
. The
third form concatenates the three strings wholly within a block of code,
returning the complete result.
To assign a value to a variable, one proceeds as in C.
For example, the following statement increments counter
:
counter = counter+1
A return statement consists of a '=' sign followed by an expression. When the statement is executed, the expression is evaluated and its result converted to a string and returned.
For example, the following rule will return the result "5":
five: { = 2+3 } ;
The Dada Engine has the capability to include prepared files containing predefined rules, mappings and transformations. This is useful, as it allows you to reuse code more easily. In addition, the Dada Engine comes with a small library of standard rules which may be useful for use in scripts.
A few commonly used mappings and transformations are provided in the file `stdmap.pbi'. These mappings include:
upcase
upcase-first
The format library is a set of standard rules which handle various aspects
of output for various formats. It contains definitions of these rules for
various output formats (currently supported formats are plain text, HTML
and troff
, but more can be added easily).
To use these rules in your scripts, first include the file `format.pbi' in your script, by including the following line:
#include <format.pbi>
Then, use the special rules described below to do formatting.
If you use the format library, you can make your script output in a different format by defining the preprocessor symbol representing the output format. Currently, defining HTML will cause the output to be formatted in HTML, otherwise it will be plain text.
Below are documented the various special formatting rules currently implemented:
The PROLOGUE
rule should be invoked before any other rules which
may emit any output. It outputs whatever may be necessary
at the start of a document in the format.
The EPILOGUE
rule should be invoked after all other rules which may
emit any output. It outputs whatever may be necessary at the end of
a document in the format.
The TITLE
rule is used optionally at the start of a document for
outputting a title. This rule is invoked with one argument; the title
of the document to be emitted. It must be called before any body text is
emitted; i.e.,
TITLE("The Passion Considered as a Mornington Crescent Game")
The AUTHOR
rule is used optionally at the start of a document
(after TITLE
but before BODY
) for emitting a formatted
author name; i.e.,
AUTHOR("J. R. Dobbs")
The AUTHOR_INST
rule is used optionally at the start of a document
(after TITLE
but before BODY
) for emitting a formatted
byline, including an author name and institution; i.e.,
AUTHOR_INST("M. Faustroll" "Institute de Pataphysique")
The SECTION
rule is used within the document body to emit an
automatically numbered section heading; i.e.,
page7: SECTION("THE REVELATION") "Just prior to the decade of the nineteen-sixties, ..."
The FOOTNOTE
rule is used within text to insert an automatically
numbered footnote. A footnote marker is inserted into the text where
the rule is invoked, and the text of the footnote is inserted further
down with any other footnotes. Example:
"Whereupon, by the law of negative reversal" FOOTNOTE("The LAW OF NEGATIVE REVERSAL states that ...")
The BODY
rule is invoked before any body text is emitted.
It causes any formatting code required for the body text to be emitted.
The BRK
rule is invoked to generate a line break. It emits
whatever corresponds to a line break in the output format.
The PBRK
rule is invoked to generate a paragraph break. It emits
whatever corresponds to a paragraph break in the output format.
The BOLD
rule causes its parameter to be set in bold, if this
is supported in the current output format; e.g.,
"This sentence is about " BOLD("these three words") ". "
The ITALIC
rule causes its parameter to be set in italics, if this
is supported in the current output format; e.g.,
"Half a bee, philosophically, must " ITALIC("ipso facto") "half not be."
At first, the task of writing a script to generate random text in any complex genre with enough flexibility to be interesting may seem daunting. Looking at the size of some complex scripts -- such as the Postmodernism Generator -- may not help to shake this sensation. But, as usual, there are practical techniques which help in this task.
One approach to the task of writing a script is the bottom-up approach. Basically, this means starting small and progressively enhancing and enlarging the script. For example, if you're writing a script to generate journal articles, start by writing a rule to generate sentences, and then go on to paragraphs, sections and finally the whole article. Whilst enhancing the script, add rules for generating titles, citations, footnotes and the like, as well as any new sentence forms, terms and other elements that you think of.
Another approach is to take existing material in the genre, write it in and generalise it. For example, if you are writing a script to generate travesties of technical papers, you may want to put in a sentence fragment such as "can be proven to be finite". If you do so, it may be an idea to think of other terms that may be substituted for "finite" in this context; for example, "NP-complete" or "valid" (or even "infinite"). This approach, if used carefully, can be very effective.
Finally, keep in mind that every new alternative you add to a rule decreases the probabilities of the other alternatives being selected. If you think that one alternative is not quite appropriate, it may be a good idea to delete it.
The following are devices which I have commonly employed in scripts. You may find them useful as well.
If you wish to either emit a string or not emit it, with a probability of 0.5 either way, the formation for this is:
[""|"your string goes here"]
While the Dada Engine currently has no facilities for interaction, there is a way to specify parameters at run time. This can be done using the C preprocessor. For example, if you wish to write a script that will use a person's name, specified by the user, in its output, you can do something like the following:
blame: "It's all " NAME "'s fault! " ;
NAME
is not specified as a rule, but is left undefined; when the
script is run, the user defines it on the command line, using a form like:
dada -DNAME=\"Fred\"
If no name is defined on the command-line, the script fails with an error; to bypass this, you can supply a default name beforehand, using preprocessor directives, as follows:
#ifndef NAME #define NAME "Bob" #endif
The use of the preprocessor is not limited to substituting strings into the output; conditional defines may be used to selectively enable and disable parts of rules.
factoid: "they are watching us" | "things go better with Coke" | "the woodpecker never bashes his brains out" | ["love"|"greed"|"angular momentum"] " makes the world go round" #ifdef UNIX | "motd is not a daemon but a text file" #endif ;
In this example (excerpted from an actual script), the user can specify beforehand whether or not the output should be able to contain references to UNIX(R) concepts; thus the output may be tailored to a non-technical audience.
For more information, see the documentation on cpp
, the C
preprocessor.
There may be times when you want a script to make a choice for a rule
only once every time it is run, and to use that choice every time the
rule is subsequently invoked. This can be done using variables.
For example, the following pair of rules selects a value for the rule
name
the first time it is invoked, and returns the name previously
selected at each subsequent invocation:
name: v-name<<new-name ; new-name: "Alice" | "Bob" | "Carol" ;
Here, the rule name
conditionally defines the variable
v-name
; if v-name
is undefined, the rule new-name
is invoked and its result is assigned to v-name
and returned;
otherwise the current value of v-name
is returned as the result
of name
without invoking new-name
.
1.0: Released 19 Feb 1996. Error line/filename tracking now works properly with files containing included files. Things which didn't work on Solaris 2.0 have been (mostly) fixed. The "BRK" rule has been added to the format library. A lot has been added to the documentation.
1.0b1: Released 5 Jan 1996. The first limited release of the Dada Engine.
pb 0.91: Released in early 1995. A much earlier, and more primitive, incarnation of what became the Dada Engine.
I am grateful to everybody who helped with the development and testing of this package, in particular to Mitchell Porter, Jamie Cameron, Brandon Long and kristin buxton.
The inspiration for what was to become the Dada Engine came to me in 1994 when I read Godel, Escher, Bach: an Eternal Golden Braid by Douglas Hofstadter; in particular, when I read the section titled "A Little Turing Test", on page 621.
The initial application of the Dada Engine, the Postmodernism Generator, is included with this distribution, and may also be accessed through the World Wide Web at the following URL:
http://www.cs.monash.edu.au/cgi-bin/postmodern
.
Currently, the pb language is a bit inelegant; it contains a number of different features, hacked on when they were needed. At some stage in the future, I intend to rewrite the system, replacing the language with a similar, but more elegant, language. This is likely to be a Turing-complete functional language with a format similar to the pb language. I would aim to include some sort of ML-like pattern-matching capability in this language, as well as a number of data types, such as sets and lists, which would be useful for generating certain types of text.
In the shorter term, the current Dada Engine can be further enhanced; one feature which would be useful and which I intend to add is operators for changing the probabilities of options being selected. Also, a system for compiling Dada Engine scripts into stand-alone programs in a language such as C or PostScript is an option which I may explore.
This document was generated on 17 April 1996 using the texi2html translator version 1.49.