CS 130 Lab #4: Representation and manipulation of textual data
Goals
In this lab we will explore how text is represented or encoded in terms of binary
codes (which can then be more conveniently written using hexadecimal digits).
We will look at a historical progression of more sophisticated standards for
encoding characters, do some conversions by hand, and then use a spreadsheet to
implement some interesting conversions as "programs".
Foundations: bits, bytes and codes
A byte is a "chunk" of binary data consisting of some number of bits taken as
whole, typically 8 bits. For example:
1 0 1 0 1 1 1 0
An 8-bit byte can be represented as above, or as 2 digits in hexadecimal (why?).
For example, the byte above could also be written as:
AE
Of course, the byte can also be represented as a decimal value when considered
as a number:
AEhex = (10 x 16) + (14 x 1) = 174dec
Characters are symbols that we sue for writing. In order to be stored,
processed and displayed by computers, characters are represented or encoded
into binary form (sequences of 1s and 0s). These representations are usually
made in terms of a fixed number of bits, often 8 bits: in other words, a byte is
often used to represent a single character.
Historically, however, there has been a lot of "fuss" over just how this is
done: older encodings used 5 bits (see the Baudot code below). A modern
world-wide standard originating in the USA (the ASCII code, pronounced "ask-ee")
uses 7 bits, but was often "rounded up" to 8 bits to fit a typical byte more
exactly. Extended ASCII codes were later introduced and (more or less)
standardized to allow a wider variety of Western (European) languages to be
encoded on computers. The modern approach to these issues is embodied as the
Unicode standard, which can use even more bits (in some forms) to encode nearly
all the symbols used for writing in any language in the world (and some not
even of this world, in some sense).
Some background reading
Read the history of character codes listed in the first item below, toward the
goal of answering the hand-written exercises in the following section. You
may find some of the other links in this list useful for various parts of
the exercises.
Hand-written exercises (for demo!)
Determine how your first and last name would be written out using:
- Morse code: write the dots and dashes out on paper.
- Baudot code: write out your name, and show the corresponding codes in
binary and decimal, using one decimal number per 5 bits of Baudot code
- ASCII code: write out your name, and show the corresponding codes in
binary and decimal, using one decimal number per 7 bits of ASCII code.
In addition, you should show the codes for your name using hexadecimal
digits, i.e., 2 hex digits per 7 bits (add a leading zero).
Finally, what is the Unicode representation of the biohazard symbol? (you may
want to use a search engine to find this informat
Spreadsheet exercises (for demo!)
For this section, you should develop a spreadsheet "program" in Excel which
will convert your name (or any characters entered) into several different forms.
See if you can make your spreadsheet look like this example:
The input string (Name) appears in the upper left; successive rows of the
sheet then display:
- individual characters in cells (use the MID and COLUMN functions;
MID allows you to select out characters from a piece of text,
COLUMN allows you to know which column the current cell is in,
numerically);
- the decimal codes for these characters in ASCII (use the CODE function);
- the hexadecimal pairs corresponding to the above decimal codes (use
the DEC2HEX function);
- the hexadecimal codes, but with the pairs reversed in order (use the
CONCATENATE function and the MID function again; CONCATENATE will
allow you to put individual characters together into a longer string);
- the decimal codes for these reversed hex pairs (using HEX2DEC);
- the characters represented by these codes (using the CHAR function);
- (OPTIONAL!) the whole string of converted characters (called "Funny Name"
in the picture above).