Designing a file read utility – a perspectiveThe original of this article was presented in comp.programming and comp.lang.c as an exercise in thinking out a design. There were a number of valuable comments and suggestions. In this page I have preserved the original content of the article in black. Afterthoughts are interpolated in red.
Writing a routine to read lines from a file is one of those little design tasks that seems to create disagreement and confusion. The routine has various names; in this article we will call it fgetline. Here is a take on how to do it. What we want is a routine that keeps reading and returning lines from a file. A simple prototype for this function is char * fgetline(FILE *); The names, getline and fgetline, are already in common use. In the end I decided to use getfline instead.There are some gotchas with this prototype – well, not really gotchas, rather little bits of awkwardness and inefficiency. One is that we are throwing away information that (usually) has to be recomputed, namely the length of the line and, as we shall see, status information. If we want to add the information to our prototype there are several that to do it – it can be passed back through the calling sequence or it can be returned by the function. (We could also pass it to a global – ugh, don’t do it.) My vote is to have it all returned by the function which means we create a struct (record) that looks like: struct fgetline_info { char *line; size_t len; int status; };There are various choices that could be made in the types – season to taste. A second issue is that there is no limit to how large the line can be. A plausible way to provide a limit that is to pass one in as an argument. If we do, our prototype looks like this: struct fgetline_info fgetline(FILE * fptr, size_t maxsize); Here I went wrong. Having a struct (record) to hold information is plausible, but it may be overkill. Be that as it may, in a C implementation this prototype doesn’t work too well. First of all the user has to go through the struct to access the line and the status field. More importantly the normal usage in C would be a while loop, e.g.,This version has a little problem – where did the storage for the line come from? One way to deal with that is for fgetline to allocate storage for the line and for the calling routine to free it when it is done with the line.struct fgetline_info fg; ... while (fg = fgetline(FILE *fptr, size_t maxsize)) {...}This doesn’t work – the idiom requires a test on something that can be “zero”. If we convert the prototype and fg to a pointer the code still doesn’twork. Now the problem is that when the read is successful we don’t need the status field; when the read fails we don’t have the status field. Okay, so how do we do this in fgetline? Well, there is a very simple way to do it, one that has been reinvented time and time again. We start out allocating a standard amount of space (numbers like 32, 64, and 80 bytes are typical) for a line. We read from the file until we either hit an EOL (end of line marker), a failure to read, or we have read as many characters as were allocated. If we haven’t hit the EOL we reallocate the array, commonly by doubling the size, and doing another read. This works, but it is inefficient – we have to call malloc and free for every line. Whether the cost of calls to malloc and free matters obviously depends on the application. However (a) the costs are commonly underestimated, and (b) a general policy of ignoring apparently minor costs can cumulate and produce unexpected inefficiencies.One way to get around this is to use a technique I call highwater buffering. In the file containing the code for fgetline we have two file-scope variables: static char * buffer = 0; static size_t size = 0;In fgetline we have some initialization code that allocates buffer space when the buffer size is zero. Thereafter we use our little doubling trick. The advantage of this scheme is that we have at most a small number of calls to malloc/realloc (the number of doublings needed to get to the largest line) and no calls to free. The very real disadvantage of this scheme is that it produces a dirty copy of the line. By a dirty copy, I mean one that can be scribbled on elsewhere before we are done with it. This will happen whenever fgetline is called anywhere else with any FILE argument before our next read. This is double plus ungood. As a note, producing dirty copies “works” provided that there is only one file being read at a time. Don’t count on it. Is there a way to get the efficiency of the highwater scheme without being dirty? Yes; the trick is that the calling program provides the initial buffer. If we add the buffer information to the prototype we get: fgetline_info fgetline(FILE * fptr, char * buffer, size_t size, size_t maxsize); This is a worst of all worlds prototype – it combines an awkward return type with a cumbersome calling sequence.There are three kinds of copies of the line that we might get from fgetline – clean, transient, and dirty. A clean copy is one that has no encumbrances – it lasts until it is specifically deleted. A transient copy is one that lasts until the next call to fgetline to get another line from the current file. (There may be another file being read elsewhere.) This distinction between kinds of copies that we might get is a critical one; without it either a default is chosen without weighing its merits or, worse yet, concepts are mixed and subtle bugs can slip in. What kind of copy should fgetline provide, clean or transient? There are arguments for each choice. Clean copies are more expensive but safer. Transient copies are (usually) cheaper. In questions like this there is much to be said for the design principle that: When there is a significant choice as to the kind of output being delivered the library routine should let the user make the choice rather than dictating the choice unless offering the choice unduly complicates usage. This is a good maxim but it is not the whole story. If one choice commits the user and the other doesn’t, provide the noncommital result. In this case providing clean copies is the commital choice; if the routine provides a clean copy the user cannot opt to get a transient copy. The choice commits the user to a malloc cost and an eventual free cost. If the routine returns a transient copy the user can duplicate it to get a clean copy.So how do we provide a choice? That is simple; if the buffer pointer in the calling sequence is NULL, fgetline must return a clean copy; otherwise fgetline MAY return a transient copy. I say “may” because it might have to increase the buffer size. If it does the calling routine will have to check whether there was a size increase and, if so, will have to free the returned buffer. (The fgetline implementation can’t realloc the passed in buffer; it will have to do a malloc for the first resizing.) This is an example of a serious error in design. Let us accept for the moment that we want to offer a choice; naturally we ask how to do it. What I did was to envisage an implementation and treat it as the only alternative. Instead of thinking in terms of a potential implementation I should have thought in terms of API alternatives. An obvious alternative is to set a flag instead. It’s probably best to have a value in the status field that signifies the size has been increased; call that value fg_increased and a normal read fg_normal. Then the usage for transient copies might look like this (first cut): ... struct fgetline_info fg; char buffer[80]; int done = 0; .... for(;!done;) { ... fg = fgetline(fptr,buffer,80,FG_MAXSIZE); if (fg.status == fg_normal || fg.status == fg_increased) { /* do stuff */ if (fg.status == fg_increased) free(fg.line); } else done = 1; }If we want clean copies the corresponding loop body would be fg = fgetline(fptr,0,0,FG_MAXSIZE); if (fg.status == fg_normal) { /* do stuff */ free(fg.line); } else done = 1; This schematic code illustrates why it was a bad idea to return a struct and to put the line pointer in the struct. Also it should be be easy enough for the implementation to free that pesky increased line; having the user do it is an unnecessary complication.What sort of return values should be available in the status field? These are the successful ones that occur to me: end_of file The last read (if any) found an EOL marker. The current read finds an immediate EOF. no_increase Normal read - either buffer was null or else it was not increased. increase Normal read - the buffer size has been increased. abn_no_increase Abnormal read - an EOF was found without an EOL. Buffer was not increased. abn_increase Abnormal read - an EOF was found without an EOL. Buffer was increased. The increase/no_increase distinction is unnecessary. That leaves three cases, normal EOF, normal line read, and a premature EOF in the final line.In addition there are numerous kinds of errors that can occur. Calling sequence arguments include: no_file The file pointer is null. bad_buffer One and only one of buffer and size is zero. bad_size Size is 0 or is greater than maxsize bad_maxsize Maxsize is 0Then there are the memory allocation failures: bad_allocate Malloc or realloc failure big_line Line length is greater than maxsize.When there is a memory allocation failure the line and len fields hold what has been successfuly read. This way of proceeding – listing error tags and short explanations – has its faults; it comes from focusing too heavily on potential code. For example, the calling sequence argument errors only apply to a particular choice of calling sequence. A better approach might be to try to analyze the kinds of errors that can occur. Briefly, these include: There are various conventions one could use for assigning code values for the possible return values. My view is that there should be a bit that is set only if there was an error, a bit that is set only if there was a memory allocation error, a bit that is set only if there was a buffer increase, and a bit that is set only if an EOL occurred. The point of doing this is that we can have simple tests to check whether an error occurred, whether some space has to be freed, and whether the file read is complete. This is a useful technique. However to avoid magic numbers disease the positions of the various flag bits should be defined separately in the include file that spells out the file read package prototypes. As a final note, there is one minor decision left unsettled, namely should the returned line include an EOL marker before the string terminating 0. My take is that this matter is too trivial to warrant adding a flag to the calling sequence, and that it will be slightly less confusing if there is one present even if it has to be manufactured. Perhaps someone has a convincing argument one way or the other. The issue here is whether the delivered line contains n (the EOL character) or not. In other do lines look like abcn or abc ? Since we are already committed to using flags we may as well have a flag to select whether to include the EOF character; the appropriate default is to not include it. This page was last updated November 1, 2007. |