For example, most files will have some sort of code or "tag" at the beginning which names or otherwise signals the overall file format. For graphic files, this tag is usually followed by information about the dimensions of the graphic: height and width in pixels and "depth" in bits. What order this information comes in, and how many bits or bytes are used, is specified in the file format signaled by the tag. Following the dimensional information, we might see "raw" pixel information (also called "raster data") and, in some cases, a color table. Note that the proper interpretation of the pixel information will depend on the dimensions and, if present, the color table.
In more complex file formats, these kinds of dependencies can make it quite difficult to "read" the file directly, as layers of context build up. In addition, most file formats use direct binary representations for the ultimate atomic data (e.g., color information). Most editor programs are configured to display text, assuming something like ASCII code is being used to represent the characters. In the case of so-called binary files (more accurately, files whose proper interpretation does not involve an ASCII-like character code), we might want the editor to display the data in its most user-accessible form, such as colors, numbers, etc. However, in many cases (such as today's lab) we want to specifcally focus on how binary representations are being used, so we want to view the data not as text, and not directly as colors, etc., but in terms of the "raw" binary codes. For practical purposes, this means that we will look at the codes in hexadecimal form, since it preserves the structure of the underlying binary, but makes for shorter and more easily distinguishable patterns.
In this lab we will look "inside" the binary details of a few different file formats to see how data of various kinds are represented. You will need to copy all of the sample data files from the following folder onto your hard drive, and also to locate and run the "hex editor" program we will use, called XVI32.
cs130/docs
The hex editor we will use is called XVI32: it can be found on the lab computers in the "My Computer" section, "Local Disk (C:)", in the "Program Files" folder. (You may also download a copy of this free program for your own computer from the author's website, presuming that you use Windows (Mac users can find similar programs: ask me if you need help locating one).
Once you launch the XVI program, you will be able to open files and view and edit them directly in terms of the hexadecimal codes which comprise them: you should always work on a copy of a file if you wish to keep the original, since the changes you make at this level can corrupt the file so that it is not valid or recognizable by its "native" programs (Word, Excel, MS Paint, etc.). For lab purposes, of course, fresh copies of the files will always be available on the website.
On the hex side, each byte in the file (8 bits) is represented by two successive hex digits (e.g., 1A or 23. Remember that even if the digits look like decimal, they are probably in hex format, so that a value of 23 is really a decimal value of (2 x 16) + (3 x 1) = 35.
Be able to answer these questions for your demo:
Read some of this page and see if you can make sense of it (this is an exercise in "technical reading", versus "technical writing"). Don't worry if you have trouble understanding it: the point of trying is to get a feel for the style of these kinds of documents, not to understand everything at first.
Now copy over the helloSmall.bmp file from the docs folder and open it with the MS Paint program (use "Open With ..." on the right mouse button if necessary).
You can read about the BMP file format and the layout of the data in the on-line reference, but this chart should help to visualize it:
(Note: the odd, mottled-looking color in some parts of this picture is due to limitations in the color model used in GIF files.)
The picture shows how the header, color table and pixel data areas are laid out. Note especially that the color table is broken up into 4-byte "chunks" (each byte being represented by two hex digits) and that the pixel data is broken up into 2-byte chunks, each of which is a numeric index into the color table. Finally, note that the numeric indices of the raster or pixel data are not addresses or locations of data within the file itself, but rather count how many 4-byte chunks from the start of the color table a certain color is.
We need to know two other things: first, the position within the file of the beginning of the color table and, second, the size of the "chunks" which are counted by the index used in the pixel data.
The position of the beginning of the color table can be determined either by reading the documentation about BMP files closely, or by looking for the right patterns in the hex values. It turns out that the first color in the table is white, represented by hex FF FF FF 00 (the highest amount possible (FF) of each of red, green and blue plus the always-zero trailing byte.
From the picture and description above, we know that the colors in the table are broken up into 4-byte chunks. Therefore, if a color is at an index i, for example referred to in the raster/pixel data, we can determine its address or location a in the file itself as:
a = s + 4 x iwhere s is the start of the color table.
This information can also be used to determine the index from the address, using a little algebra as follows:
a = s + 4 x i(here we have just subtracted from both sides and then divided both sides by 4).a - s = 4 x i
(a - s) / 4 = i
Remember, however, that all these numbers are likely to be expressed in hexadecimal inside the file: you should therefore either convert from hex to decimal and back again, if you are calculating by hand, or do the arithmetic using a hexadecimal calculator (the WIndows utility calculator can be used in a hex mode).