algorithm: general method of implementing function (getting result from argument)
program: concrete implementation of algorithm in some language
partial functions (may be undefined or loop forever) versus total functions (always returns a result)
Symbols, sets and products
"atomic" symbols
symbols in isolation are not useful, but all come with a cultural context
semiotics distinguishes between three kinds or aspects of signs
(sign is a technical term for a "unit" of communication, something which can convey meaning in context)
indices: the sign is tied in a causal way to meaning (e.g., smoke "means" fire, howling "means" a wolf)
icons: shape or form is similar to meaning (perhaps the sign is an abstraction of apearance)
symbols: an arbitrary (but consistently recognizable) form which gains meaning by cultural/contextual agreement
notion of "atomicity" is relative: names often have structure, but are considered unitary for many purposes
in mathematics, abstract atomic "individuals" have meaning only through their relation to other individuals
sets and enumerated types
sets of symbols are usually called alphabets
in mathematics, finite sets without any other structure are significant only in their cardinality (number)
in semantic terms, sets of symbols are often important for their consistent relation to (suitably related) meanings
Issues re sets, types, domains and ranges
thinking of meaning as a function, there is always a question of what the proper domain and range are (Is "1012" an appropriate binary numeral? What day of the week is January 38th? etc.)
in computer science, we try to use (static or dynamic) types as constraints to avoid "meaningless" terms
static typing rejects meaningless terms upon construction, as not well-formed (by some rules of syntax)
dynamic typing checks for semantic sensibility during the resolution of meaning (i.e., during evaluation)
ideally, we have a nice crisp domain (every element is meaningful) in one-one corresp. with meanings
but in practice, we often draw meaningful terms from a superset of representations/forms (in the worst case, arbitrary strings over some über-alphabet)
and in practice, there may be meanings we cannot signify directly (e.g., uncomputable functions)
Haskell's (and other languages) enumerated types provide isolated (crisp) symbolic domains
an ordering on set elements is often useful (think of deriving Ord class for enumerated types)
pairs and products
pairs can be taken as fundamental (combinations) or constructed (e.g., in set theory) in terms of their order
pairs are necessarily represented syntactically in terms of order (one first, the other second)
the ordering of parts of a pair and ordering of underlying elements yield a derived order
lexicographic order: compare first parts, and only if they are equal, second parts
we can also enumerate (count) pairs in terms of sum-and-product of the enumeration of their parts (using e.g. the sumProd function, whose "base" argument represents the size of the less significant pair-part)
conversion the other way (representation) can be done by using division-and-remainder (divMod)
Numbers and Numerals
basis for number is in quantity (cardinal) and counting (ordinal)
in the abstract, we equate things "in number" when there exists a one-one correspondence between them
we can represent (abstract) numbers pictorially by means of collections (a la dice) or the number line
the number line has the advantage of suggesting the whole set of (e.g.) natural numbers at "once"
… but it also provides a sense of their fundamental logical structure (via zero and successor)
Peano-style naturals (built from zero and succ) are a logico-linguistic "take" on the number line
(numbers are constructed as well-formed terms from a constant and a function/constructor)
with each numeral we can associate the corresponding iteration (application of a function several times to a constant)
Peano-style naturals constitute a kind of "unary notation" (but it's not a proper special case of radix/base notation)
efficiency concerns (especially of representation, but also calculation) motivate positional notation
positional notation is essentially an extension of lexicographic order, with multiple places
if the symbol set (digits) are uniform across places, we can extend the notation indefinitely (otherwise we are restricted to representing numbers as large as the product of digit sets per place)
"evaluation" of (unfiorm base) numerals can be construed in terms of powers
sum of (for each digit place: product of each digit value with base to power (of that place))
but evaluation can also be construed in terms of Horner's rule, building up as we go
from the left: add each new digit value to total, then mutliply by total by base as we proceed
the latter technique is easily captured in terms of a fold (using sumProd)
conversion the other way (representation) can be done by using division-and-remainder (divMod)
this can be written in Haskell as an unfold, using divMod)
Formal languages: structure and complexity
Formal languages as sets of strings
abstract notion of symbols (must support equality, i.e., be identifiable and distinguishable)
strings are just sequences (of arbitrary but finite length) of symbols, over some alphabet
main operation on strings: concatenation
a (formal, string) language is just a set of strings
i.e., a sub-set of the whole set of strings over the (implicit) alphabet
but also a property (and thus a boolean function) over the whole set of strings
Finite automata
finite automata are formal models of simple machines with states (nodes) and transtitions (moves, arrows)
FAs come in two flavors, deterministic (DFA) and non-deterministic (NFA), but they are inter-convertible
a DFA is a 5-tuple, <Q,∑,d,q0,F> (states, alphabet, transition function, initial state, final states); d takes a state and a symbol and returns a new state (dŒ Q x ∑ Æ Q)
the DFA starts in the initial state, moves from state to state based on transition function; if it ends in a final state (when the string is consumed), we say the string is accepted
associated with each DFA is thus a set of strings (the ones it accepts), called the language of the DFA
Regular expressions
a regular expression (RE) has one of several forms: ∆, e, a, R | S, R·S or R* [the · is often suppressed] (here a is a symbol from some implicit alphabet, and R and S are regular expressions)
the meaning of an RE is a set of strings:
[∆] = {} (the empty set)
[e] = {""} (the singleton set with an empty string)
[a] = {"a"} (the singleton set with the one-length string containing just 'a')
[R | S] = R » S (union of sets of strings from R and S)
[R·S] = {xy | x Œ R, y Œ S} (concatenations of strings from R and S, one each, in that order)
[R*] = » R^k, k≥0 (set of concatenated sequences of members from R, of any length)
associated with each RE is thus a set of strings, called the language of the RE
Regular languages (in the abstract)
REs and DFAs are inter-convertible, in a way which preserves their languages
both REs and DFAs can define only fairly simple languages: linear repetition, no nesting/coordination
the "canonical" example of a language which they cannot define is {a^n b^n | n≥0}
the existence of these two (and other) equivalent representations for the same "class" of languages justifies our identifying it as a "real" entity, the regular languages
Context-free languages
a context-free grammar describes the structure of a language (recursively) in terms of "nested sub-phrases"
canonical example (arithmetic expressions): E ::= N | E + E | E x E | ( E )
numerals as a CFG: N ::= D | DN [but no nesting, so also possible as an RE: D·D*]
formally, a context-free grammar (CFG) is a 4-tuple <S,V,T,R> (start symbol, variables, terminals, rules)
S is a special member of V, V is a set of symbols (not appearing in actual strings), T is an alphabet of symbols (these do appear in strings) and each rule in R is a pair from V x (V»T)*
a derivation step takes a string of symbols (V»T) and replaces some occurrence of a variable by the corresponding right-hand side of a rule in R (example: 2 + E * E => 2 + E * ( E ) )
a string (of terminals) is generated by the grammar if there is a multi-step derivation starting at symbol S and proceeding by steps (according to the rules) until we reach that string of terminals
example: E => E + E => N + E => 2 + E => 2 + 5
for each derivation there is a naturally corresponding tree
The Chomsky hierarchy
different mechanisms/formalisms are available for describing languages (grammars, machines, etc.)
these form a hierarchy of lesser to greater descriptive power (simple distinctions versus complex ones)
regular:: essentially linear repetition structure (DFAs, NFAs, reg. exprs)
context-free: tree-like structure, without contextual restrictions (CF grammars)
...
general/computable: arbitrary expressible complexity (the full power of any computing formalism)
uncomputable: by cardinality arguments (below), not all functions can be represented concretely
Computability issues
since the cardinality of the set of (arbitrary-length) strings over a non-trivial alphabet is countable, the cardinality of sub-sets (languages) over that set is uncountable (power-set rule, diagonalization)
since descriptions of languages are finite strings themselves, there are only countably many of them
therefore "most" languages (uncountably many) are not expressible or describable at all
programs (as finite representations of functions) suffer similarly: most functions (on countable sets) are not computable, since there are uncountably many such functions
Algebraic term languages
Strings versus terms in formal languages
traditional formal language theory (Chomsky-style) concentrates on the complexity of sets of strings
we want to focus instead on the interpretation (meaning) of terms (trees derived from strings)
Compositional meaning of terms
we will consider evaluation or meaning of terms as functions from terms to various domains
we distinguish standard interpretations (e.g., numbers for arith. exprs) from non-standard ones (strings, "polarities")
meaning is compositional: meanings of terms will be defined relative to meanings of sub-terms (i.e., meaning will be a recursive function over sub-terms)
Term languages developed in class
we explore several different language domains, a range of increasingly-complex language features, variations in meanings given to them, and techniques for representation of terms and meanings in Haskell
Dimensions of domains and language features
simple numeral structure (radix-based); a lab exercise
arithmetic expressions with:
literals and binary operators
variables-as-unknowns (single or multiple)
scoped variables (where definitions) [not on the exam]
imperative variables [not on the exam]
boolean expressions (literals, binary operators and variables-as-unknowns)
regular expressions (sets of strings)
Dimension of variation in “meanings”
standard meanings as "atomic" domain (e.g., integers or booleans) or structured (sets of strings, below)
single variables (unknowns) as polynomials or as (meta-language, Haskell) functions on integers
multiple variables via functions on tuples or environments
environment = function from variables to values; passed in to each sub-term
scoped variables via syntactic term substitution or via evaluation and environments
regular expression semantics
as sets of strings (really lists of strings; but we must order them to avoid infinite searches)
as predicates (i.e., boolean-valued functions)
imperative variables via state transformations (to model changes over time) [not on the exam]
state is also a function from variables to values; passed in to and out of each sub-term
imperative variables via compilation to abstract machine code (lists of instructions) [not on the exam]
Non-standard meanings or interpretations
re-representation of terms as strings (prefix, infix, postfix forms; other variations possible)
infix "x + 2 * 3 - y" versus postfix "x 2 3 * y - +"
calculation traces (lists of terms showing successive evaluation; like a calculator tape)
polarities (lab exercise): an abstract interpretation giving just the numeric signs of terms
Dimension of Haskell term representations
general issues [see also hand-out]
"sloppiness": do we allow invalid forms relative to the language, e.g., negative numbers
dependence on meta-language ("pure" symbolic representations are inefficient but less biased)
flexibility: can we get from the representation form to various aspects we might want
for numerals/numbers
strings OR internal Haskell integers OR recursively-defined Naturals
for operators
strings OR abstract symbols OR functions (note: inflexible, functions don't support equality, printing)
for term data type
use separate constructors for each operator
Term = Lit Int | Plus Term Term | Mult Term Term | ...
OR, use a carrier type for operators (but bake-in references to external types)
Term = Lit Int | Bop Opr Term Term
Opr = Plus | Mult | ...
OR, use type parameters of the data type to allow other choices for carriers
Term a b = Lit a | Bop b Term Term
ArithExpr = Term Int Opr
BoolExpr = Term Bool LOpr ; LOpr = And | Or | ...
the last variation (parameterized) gives the most flexibility
Haskell as a meta-language
knowledge of Haskell will not be tested as such on the exam, but we will use it as a meta-language
Basic concepts
interactive interpreter relative to external file (script) of definitions
familiar algebraic syntax, but with fewer parentheses (e.g., "f x" rather than "f(x)")
usual integers, floats, booleans, plus arbitrary-precision integers, rational numbers, pairs
several simple notations for lists (represented internally using linked lists)