Выбрать главу

What's Next

Obviously, you could do a lot more with this code. To turn it into a real spam-filtering application, you'd need to find a way to integrate it into your normal e-mail infrastructure. One approach that would make it easy to integrate with almost any e-mail client is to write a bit of code to act as a POP3 proxy—that's the protocol most e-mail clients use to fetch mail from mail servers. Such a proxy would fetch mail from your real POP3 server and serve it to your mail client after either tagging spam with a header that your e-mail client's filters can easily recognize or simply putting it aside. Of course, you'd also need a way to communicate with the filter about misclassifications—as long as you're setting it up as a server, you could also provide a Web interface. I'll talk about how to write Web interfaces in Chapter 26, and you'll build one, for a different application, in Chapter 29.

Or you might want to work on improving the basic classification—a likely place to start is to make extract-features more sophisticated. In particular, you could make the tokenizer smarter about the internal structure of e-mail—you could extract different kinds of features for words appearing in the body versus the message headers. And you could decode various kinds of message encoding such as base 64 and quoted printable since spammers often try to obfuscate their message with those encodings.

But I'll leave those improvements to you. Now you're ready to head down the path of building a streaming MP3 server, starting by writing a general-purpose library for parsing binary files.

24. Practicaclass="underline" Parsing Binary Files

In this chapter I'll show you how to build a library that you can use to write code for reading and writing binary files. You'll use this library in Chapter 25 to write a parser for ID3 tags, the mechanism used to store metadata such as artist and album names in MP3 files. This library is also an example of how to use macros to extend the language with new constructs, turning it into a special-purpose language for solving a particular problem, in this case reading and writing binary data. Because you'll develop the library a bit at a time, including several partial versions, it may seem you're writing a lot of code. But when all is said and done, the whole library is fewer than 150 lines of code, and the longest macro is only 20 lines long.

Binary Files

At a sufficiently low level of abstraction, all files are "binary" in the sense that they just contain a bunch of numbers encoded in binary form. However, it's customary to distinguish between text files, where all the numbers can be interpreted as characters representing human-readable text, and binary files, which contain data that, if interpreted as characters, yields nonprintable characters.[260]

Binary file formats are usually designed to be both compact and efficient to parse—that's their main advantage over text-based formats. To meet both those criteria, they're usually composed of on-disk structures that are easily mapped to data structures that a program might use to represent the same data in memory.[261]

The library will give you an easy way to define the mapping between the on-disk structures defined by a binary file format and in-memory Lisp objects. Using the library, it should be easy to write a program that can read a binary file, translating it into Lisp objects that you can manipulate, and then write back out to another properly formatted binary file.

Binary Format Basics

The starting point for reading and writing binary files is to open the file for reading or writing individual bytes. As I discussed in Chapter 14, both OPEN and WITH-OPEN-FILE accept a keyword argument, :element-type, that controls the basic unit of transfer for the stream. When you're dealing with binary files, you'll specify (unsigned-byte 8). An input stream opened with such an :element-type will return an integer between 0 and 255 each time it's passed to READ-BYTE. Conversely, you can write bytes to an (unsigned-byte 8) output stream by passing numbers between 0 and 255 to WRITE-BYTE.

Above the level of individual bytes, most binary formats use a smallish number of primitive data types—numbers encoded in various ways, textual strings, bit fields, and so on—which are then composed into more complex structures. So your first task is to define a framework for writing code to read and write the primitive data types used by a given binary format.

To take a simple example, suppose you're dealing with a binary format that uses an unsigned 16-bit integer as a primitive data type. To read such an integer, you need to read the two bytes and then combine them into a single number by multiplying one byte by 256, a.k.a. 2^8, and adding it to the other byte. For instance, assuming the binary format specifies that such 16-bit quantities are stored in big-endian[262] form, with the most significant byte first, you can read such a number with this function:

(defun read-u2 (in)

(+ (* (read-byte in) 256) (read-byte in)))

However, Common Lisp provides a more convenient way to perform this kind of bit twiddling. The function LDB, whose name stands for load byte, can be used to extract and set (with SETF) any number of contiguous bits from an integer.[263] The number of bits and their position within the integer is specified with a byte specifier created with the BYTE function. BYTE takes two arguments, the number of bits to extract (or set) and the position of the rightmost bit where the least significant bit is at position zero. LDB takes a byte specifier and the integer from which to extract the bits and returns the positive integer represented by the extracted bits. Thus, you can extract the least significant octet of an integer like this:

(ldb (byte 8 0) #xabcd) ==> 205 ; 205 is #xcd

To get the next octet, you'd use a byte specifier of (byte 8 8) like this:

(ldb (byte 8 8) #xabcd) ==> 171 ; 171 is #xab

You can use LDB with SETF to set the specified bits of an integer stored in a SETFable place.

CL-USER> (defvar *num* 0)

*NUM*

CL-USER> (setf (ldb (byte 8 0) *num*) 128)

128

CL-USER> *num*

128

CL-USER> (setf (ldb (byte 8 8) *num*) 255)

255

CL-USER> *num*

65408

Thus, you can also write read-u2 like this:[264]

(defun read-u2 (in)

(let ((u2 0))

(setf (ldb (byte 8 8) u2) (read-byte in))

(setf (ldb (byte 8 0) u2) (read-byte in))

u2))

To write a number out as a 16-bit integer, you need to extract the individual 8-bit bytes and write them one at a time. To extract the individual bytes, you just need to use LDB with the same byte specifiers.

(defun write-u2 (out value)

(write-byte (ldb (byte 8 8) value) out)

(write-byte (ldb (byte 8 0) value) out))

вернуться

260

In ASCII, the first 32 characters are nonprinting control characters originally used to control the behavior of a Teletype machine, causing it to do such things as sound the bell, back up one character, move to a new line, and move the carriage to the beginning of the line. Of these 32 control characters, only three, the newline, carriage return, and horizontal tab, are typically found in text files.

вернуться

261

Some binary file formats are in-memory data structures—on many operating systems it's possible to map a file into memory, and low-level languages such as C can then treat the region of memory containing the contents of the file just like any other memory; data written to that area of memory is saved to the underlying file when it's unmapped. However, these formats are platform-dependent since the in-memory representation of even such simple data types as integers depends on the hardware on which the program is running. Thus, any file format that's intended to be portable must define a canonical representation for all the data types it uses that can be mapped to the actual in-memory data representation on a particular kind of machine or in a particular language.

вернуться

262

The term big-endian and its opposite, little-endian, borrowed from Jonathan Swift's Gulliver's Travels, refer to the way a multibyte number is represented in an ordered sequence of bytes such as in memory or in a file. For instance, the number 43981, or abcd in hex, represented as a 16-bit quantity, consists of two bytes, ab and cd. It doesn't matter to a computer in what order these two bytes are stored as long as everybody agrees. Of course, whenever there's an arbitrary choice to be made between two equally good options, the one thing you can be sure of is that everybody is not going to agree. For more than you ever wanted to know about it, and to see where the terms big-endian and little-endian were first applied in this fashion, read "On Holy Wars and a Plea for Peace" by Danny Cohen, available at http://khavrinen.lcs.mit.edu/wollman/ien-137.txt.

вернуться

263

LDB and DPB, a related function, were named after the DEC PDP-10 assembly functions that did essentially the same thing. Both functions operate on integers as if they were represented using twos-complement format, regardless of the internal representation used by a particular Common Lisp implementation.

вернуться

264

Common Lisp also provides functions for shifting and masking the bits of integers in a way that may be more familiar to C and Java programmers. For instance, you could write read-u2 yet a third way, using those functions, like this:

(defun read-u2 (in)

(logior (ash (read-byte in) 8) (read-byte in)))

which would be roughly equivalent to this Java method:

public int readU2 (InputStream in) throws IOException {

return (in.read() << 8) | (in.read());

}

The names LOGIOR and ASH are short for LOGical Inclusive OR and Arithmetic SHift. ASH shifts an integer a given number of bits to the left when its second argument is positive or to the right if the second argument is negative. LOGIOR combines integers by logically oring each bit. Another function, LOGAND, performs a bitwise and, which can be used to mask off certain bits. However, for the kinds of bit twiddling you'll need to do in this chapter and the next, LDB and BYTE will be both more convenient and more idiomatic Common Lisp style.