CS 130 Lab #4: Representation and manipulation of textual data

Goals

In this lab we will explore how text is represented or encoded in terms of binary codes (which can then be more conveniently written using hexadecimal digits). We will look at a historical progression of more sophisticated standards for encoding characters, do some conversions by hand, and then use a spreadsheet to implement some interesting conversions as "programs".

Foundations: bits, bytes and codes

A byte is a "chunk" of binary data consisting of some number of bits taken as whole, typically 8 bits. For example:
1 0 1 0 1 1 1 0
An 8-bit byte can be represented as above, or as 2 digits in hexadecimal (why?). For example, the byte above could also be written as:
AE
Of course, the byte can also be represented as a decimal value when considered as a number:
AEhex = (10 x 16) + (14 x 1) = 174dec
Characters are symbols that we sue for writing. In order to be stored, processed and displayed by computers, characters are represented or encoded into binary form (sequences of 1s and 0s). These representations are usually made in terms of a fixed number of bits, often 8 bits: in other words, a byte is often used to represent a single character.

Historically, however, there has been a lot of "fuss" over just how this is done: older encodings used 5 bits (see the Baudot code below). A modern world-wide standard originating in the USA (the ASCII code, pronounced "ask-ee") uses 7 bits, but was often "rounded up" to 8 bits to fit a typical byte more exactly. Extended ASCII codes were later introduced and (more or less) standardized to allow a wider variety of Western (European) languages to be encoded on computers. The modern approach to these issues is embodied as the Unicode standard, which can use even more bits (in some forms) to encode nearly all the symbols used for writing in any language in the world (and some not even of this world, in some sense).

Some background reading

Read the history of character codes listed in the first item below, toward the goal of answering the hand-written exercises in the following section. You may find some of the other links in this list useful for various parts of the exercises.

Hand-written exercises (for demo!)

Determine how your first and last name would be written out using:

Finally, what is the Unicode representation of the biohazard symbol? (you may want to use a search engine to find this informat

Spreadsheet exercises (for demo!)

For this section, you should develop a spreadsheet "program" in Excel which will convert your name (or any characters entered) into several different forms. See if you can make your spreadsheet look like this example:

The input string (Name) appears in the upper left; successive rows of the sheet then display: