Copyright © 2006 Gene Michael Stover. All rights reserved. Permission to copy, store, & view this document unmodified & in its entirety is granted.
Often, you're writing a program that needs to read lines transmitted from another program. Lines of text on many Internet-based protocols are terminated with an ASCII (carriage return, newline) sequence - or rahter, they should be. Your program has to stand alone, so you don't want to obtain the line-reading algorithm as part of a big library. What you really need is a simple function that's ready for you to copy-&-paste into your program.
At least, that's what I often need.
So here it is, implemented in C & Lisp.1 It's licensed to be useful pretty much anywhere.
Here is a copy of the license agremeent that is on the source code. The same agreement appears in each source code file. It's the ``X11 license'' with my name replacing ``the X Consortium''.
Copyright (c) 2006 Gene Michael Stover. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ``Software''), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED ``AS IS'', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL GENE MICHAEL STOVER BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Except as contained in this notice, the name of Gene Michael Stover shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from Gene Michael Stover.
Many protocols built on or for the Internet, such as FTP & HTTP, use a convention in which lines of
text are transmitted as ASCII terminted with a
carriage return (CR,
) & then a line feed (LF,
).
Programs implementing such protocols must correctly parse
lines which end with CR LF, & they are encouraged to
parse lines ending with just LF.
The command channel for FTP uses this convention, as do the request & reply headers of HTTP. Telnet & MUDs of every genre depend on it.2
We want a function which will read such lines from a suitable source & return strings with the end-of-line sequence removed.
Here are some secondary requirements, derived from the first set, plus some results of how I plan to implement the function:
When reading from an input that is already at its end, the function indicates end-of-input.
Assume the input contains ``abc'', followed by a proper end-of-line sequence, followed by ``de'', followed by an end-of-line sequence. Like this:
$ hexdump -C $HOME/tmp/a.txt 00000000 61 62 63 0d 0a 64 65 0d 0a |abc..de..| 00000009 $
The first time we call internet-read-line on that input stream (assuming the maximum line length is greater than three), the function gives us ``abc''. The next time, it returns ``de''. The third time, it indicates end-of-input.
Assume that the maximum line length we'll accept is three & the input contains ``abcde'', followed by a proper end-of-line sequence, followed by ``fghi'', followed by an end-of-line sequence. Like this:
$ hexdump -C $HOME/tmp/a.txt 00000000 61 62 63 64 65 0d 0a 66 67 68 69 0d 0a |abcde..fghi..| 0000000d $
Notice that the input contains two lines, but both are longer than the maximum we wil accept.
The first time we call internet-read-line on that input stream (assuming the maximum line length is three), indicates an error.
The function probably consumes the ``abc'' & the ``d''; it is permitted to consume anything because all bets are off after an erroneous input. Strictly speaking, the caller should not call internet-readline on that input source again, but if it did, internet-readline might return ``e''. If the caller used internet-readline on the input again, the function would indicate an error because ``fghi'' is too long.
This example assumed that the maximum line length accepted by the caller is three. If the caller had used a maximum line length of five or more, there would have been no errors. With a maximum line length of five or more, internet-readline would have returned ``abcde'' on the fist call & ``fghi'' on the second call. It would have indicated end-of-input for the third call.
Inputs to the function are
The returned value is a new, dynamically allocated string. On error or end-of-input, the return value is nil. The function does not toss any exceptions.
This algorithm assumes ``short circuit'' evaluation.
Here is the algorithm:
This Lisp implementation works whether or not the computer's native character set is ASCII. If you're reading from a stream whose element-type is character, it assumes the internal character set is ASCII. If you're reading from a stream with another element-type (namely (unsigned-byte 8)), it does not assume the internal character set is ASCII.
The is-strict & max arguments are keywords with defaults.
The function & its test programs are in the file nrdl.lisp. The INTERNET-READLINE & INTERNET-READALL functions only use symbols from Common Lisp, but the test programs require my Lisp unit test framework [1] and my permutation utilities for Lisp [2].
There is an INTERNET-READLINE function which reads one line each time you call it. There is also a INTERNET-READALL function which reads & returns a list of all the lines from the input.
Here is the source code for the functions, embedded in a great big LABELS form:
(labels
((nrdl (read-code to-char &key (is-strict nil) (max 1024))
(declare (type function read-code to-char) (type integer max))
(assert (plusp max))
(labels ((xread-code () (funcall read-code))
(xto-char (code)
(declare (type integer code))
(funcall to-char code))
(is-end (string code)
(declare (type string string))
(or (>= (length string) max)
(member code '(#x0D #x0A nil))))
(is-good-terminator (code)
(or
(and (not is-strict) (member code '(nil #x0A)))
(and (eql code #x0D) (eql (xread-code) #x0A)))))
(let ((first-code (xread-code)))
(and first-code
(do ((x (make-array max :element-type 'character
:adjustable nil :fill-pointer 0))
(code first-code (xread-code)))
((is-end x code)
(and (is-good-terminator code) x))
(vector-push (xto-char code) x))))))
(ascii-char (ascii)
;; If the ASCII value can't be converted to a native
;; character, return that value unchanged. Otherwise,
;; return a character which corresonds to the ASCII value.
;; This is analogous to CODE-CHAR."
(case ascii
;; (0 #\Nul)
;; (1 #\Soh)
;; (2 #\Stx)
;; (3 #\Etx)
;; (4 #\Eot)
;; (5 #\Enq)
;; (6 #\Ack)
;; (7 #\Bel)
(8 #\Backspace)
(9 #\Tab)
;; (10 #\Newline)
;; (11 #\Vt)
;; (12 #\Page)
;; (13 #\Return)
;; (14 #\So)
;; (15 #\Si)
;; (16 #\Dle)
;; (17 #\Dc1)
;; (18 #\Dc2)
;; (19 #\Dc3)
;; (20 #\Dc4)
;; (21 #\Nak)
;; (22 #\Syn)
;; (23 #\Etb)
;; (24 #\Can)
;; (25 #\Em)
;; (26 #\Sub)
;; (27 #\Esc)
;; (28 #\Fs)
;; (29 #\Gs)
;; (30 #\Rs)
;; (31 #\Us)
(32 #\Space)
(33 #\!)
(34 #\")
(35 #\#)
(36 #\$)
(37 #\%)
(38 #\&)
(39 #\')
(40 #\()
(41 #\))
(42 #\*)
(43 #\+)
(44 #\,)
(45 #\-)
(46 #\.)
(47 #\/)
(48 #\0)
(49 #\1)
(50 #\2)
(51 #\3)
(52 #\4)
(53 #\5)
(54 #\6)
(55 #\7)
(56 #\8)
(57 #\9)
(58 #\:)
(59 #\;)
(60 #\<)
(61 #\=)
(62 #\>)
(63 #\?)
(64 #\@)
(65 #\A)
(66 #\B)
(67 #\C)
(68 #\D)
(69 #\E)
(70 #\F)
(71 #\G)
(72 #\H)
(73 #\I)
(74 #\J)
(75 #\K)
(76 #\L)
(77 #\M)
(78 #\N)
(79 #\O)
(80 #\P)
(81 #\Q)
(82 #\R)
(83 #\S)
(84 #\T)
(85 #\U)
(86 #\V)
(87 #\W)
(88 #\X)
(89 #\Y)
(90 #\Z)
(91 #\[)
(92 #\\)
(93 #\])
(94 #\^)
(95 #\_)
(96 #\`)
(97 #\a)
(98 #\b)
(99 #\c)
(100 #\d)
(101 #\e)
(102 #\f)
(103 #\g)
(104 #\h)
(105 #\i)
(106 #\j)
(107 #\k)
(108 #\l)
(109 #\m)
(110 #\n)
(111 #\o)
(112 #\p)
(113 #\q)
(114 #\r)
(115 #\s)
(116 #\t)
(117 #\u)
(118 #\v)
(119 #\w)
(120 #\x)
(121 #\y)
(122 #\z)
(123 #\{)
(124 #\|)
(125 #\})
(126 #\~)
;; (127 #\Rubout)
;; Finally, make a valiant attempt to convert to a native
;; character. If that doesn't work, return ASCII (which
;; is almost sure to cause problems because it is not a
;; character).
(otherwise (or (ignore-errors (code-char ascii))
ascii)))))
(defun internet-readline (strm &key (is-strict nil) (max 1024))
(case (stream-element-type strm)
(character (nrdl #'(lambda ()
(let ((x (read-char strm nil)))
(and x (char-code x))))
#'code-char
:is-strict is-strict
:max max))
(otherwise (nrdl #'(lambda () (read-byte strm nil))
#'ascii-char
:is-strict is-strict
:max max))))
(defun internet-readall (pn &key (is-strict nil) (max 1024)
(element-type 'character))
(with-open-file (strm pn :element-type element-type)
(labels
((next () (internet-readline strm :is-strict is-strict :max max)))
(do ((lst nil (cons line lst))
(line (next) (next)))
((null line) (nreverse lst)))))))
To use it in your own programs, you could LOAD the nrdl.lisp file, or you could copy-&-paste the whole LABELS form into your program. Either way should work.
The C internet_readline function assumes the native character set is ASCII.
It uses Standard C/89 functions with two exceptions.
Instead of malloc & free, it uses
XMALLOC & XFREE. I picked up the habit
of XMALLOC & XFREE from some Gnu source
code. If your program calls XMALLOC & XFREE,
you are free to change the memory allocation system
easily. I often implement xmalloc as a wrapper
about the Boehm collector's GC_alloc. Also, I
often implement xmalloc to abort if we're out
of memory. You could implement xmalloc as a
faster allocator or to align memory on more
restrictive boundaries or whatever.
Anyway, it uses Standard C functions, plus xmalloc & xfree.
The function & its test programs (which I never got around to writing) are in the file nrdl0.c. The internet_readline function reads one line each time you call it. I didn't write a C version of internet_readall, as I did for Lisp.
Here is the source code for the C function:
/*
*/
static Boolean
is_good_terminator (FILE *fp, int c, int is_strict, size_t i,
size_t max)
{
return (!is_strict && (c == -1 || c == 0x0A)) ||
(c == 0x0D && fgetc (fp) == 0x0A);
}
/*
*/
char *
internet_readline (FILE *fp, Boolean is_strict, size_t max)
{
char *x = NULL;
int c, i = 0;
c = fgetc (fp);
if (c == -1) {
/*
* When the input source is already at end-of-file,
* we return NULL.
*/
assert (x == NULL);
} else {
/*
* The input is not already empty, so we allocate space &
* accumulate characters.
*/
x = (char *) xmalloc (max + 1);
while (c != -1 && c != 0x0D && c != 0x0A && i < max) {
x[i++] = c;
c = fgetc (fp);
}
if (is_good_terminator (fp, c, is_strict, i, max)) {
x[i] = '\0';
} else {
/*
* There was an error, so we ditch the memory we've
* allocated. We'll return NULL.
*/
x = (char *) xfree (x);
}
}
return x;
}
To use it in your own programs, copy-&-paste these two functions into your own program. You may need to prototype them in a header file, too.
fixme: Is this worth the time? If so, it must work whether or not I/O exceptions are enabled at compile time. And no, I don't need two versions of the function for those two situations. One function can do both without duplicating code.
/* -*- Mode: C -*-
*
* $Header: /home/gene/library/website/docsrc/nrdl/RCS/nrdl.tex,v 395.1 2008/04/20 17:25:50 gene Exp $
*
* Copyright (c) 2006 Gene Michael Stover. All rights reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining
* a copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sublicense, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL GENE MICHAEL STOVER BE LIABLE FOR
* ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
* CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
* WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*
* Except as contained in this notice, the name of Gene Michael Stover
* shall not be used in advertising or otherwise to promote the sale, use
* or other dealings in this Software without prior written authorization
* from Gene Michael Stover.
*/
/* Standard C */
#include <assert.h>
#include <ctype.h>
#include <errno.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
/* This */
typedef int Boolean;
#ifndef FALSE
#define FALSE (0)
#endif
#ifndef TRUE
#define TRUE (!0)
#endif
/*
*/
void *
xmalloc (size_t count)
{
void *p;
p = malloc (count);
if (p == NULL) {
fprintf (stderr, "\n%s:%d: malloc (%u) failed", __FILE__, __LINE__,
(unsigned) count);
abort ();
}
return p;
}
/*
*/
static void *
xfree (void *p)
{
if (p == NULL) {
fprintf (stderr, "\n%s:%d: Can't free NULL", __FILE__, __LINE__);
abort ();
}
free (p);
return NULL;
}
/*
* To use in your own programs, copy-&-paste everything
* between the "cut here" comment lines.
*/
/* --- cut here --- begin --- cut here --- */
/*
*/
static Boolean
is_good_terminator (FILE *fp, int c, int is_strict, size_t i,
size_t max)
{
return (!is_strict && (c == -1 || c == 0x0A)) ||
(c == 0x0D && fgetc (fp) == 0x0A);
}
/*
*/
char *
internet_readline (FILE *fp, Boolean is_strict, size_t max)
{
char *x = NULL;
int c, i = 0;
c = fgetc (fp);
if (c == -1) {
/*
* When the input source is already at end-of-file,
* we return NULL.
*/
assert (x == NULL);
} else {
/*
* The input is not already empty, so we allocate space &
* accumulate characters.
*/
x = (char *) xmalloc (max + 1);
while (c != -1 && c != 0x0D && c != 0x0A && i < max) {
x[i++] = c;
c = fgetc (fp);
}
if (is_good_terminator (fp, c, is_strict, i, max)) {
x[i] = '\0';
} else {
/*
* There was an error, so we ditch the memory we've
* allocated. We'll return NULL.
*/
x = (char *) xfree (x);
}
}
return x;
}
/* --- cut here --- end --- cut here */
/*
* Null test. Always succeeds.
*/
static Boolean
S_Test0000 ()
{
return TRUE;
}
/*
*/
static Boolean
S_BasicTest (Boolean is_strict, size_t max, Boolean is_binary,
char octets[], size_t octets_len, char *expected[])
{
Boolean is_good = FALSE;
char pathname[] = "test.txt";
FILE *fp;
size_t i;
char *line;
/* Create the input file */
fp = fopen (pathname, "wb");
if (fp != NULL) {
if (octets_len == 0 || fwrite (octets,
octets_len * sizeof octets[0],
1, fp) == 1) {
is_good = TRUE;
} else {
printf ("\n");
perror ("fwrite");
printf ("%s:%d:", __FILE__, __LINE__);
printf (" Did not write the list of %u octets", octets_len);
printf (" to the temporary file.");
is_good = FALSE;
}
fclose (fp);
} else {
printf ("\n");
perror ("fopen");
printf ("%s:%d:", __FILE__, __LINE__);
printf (" Could not create the temorary file, %s.", pathname);
is_good = FALSE;
}
if (is_good) {
/* Read the input file. */
fp = fopen (pathname, is_binary ? "rb" : "r");
if (fp != NULL) {
i = 0;
line = internet_readline (fp, is_strict, max);
while (line != NULL && expected[i] != NULL &&
strcmp (expected[i], line) == 0) {
line = (char *) xfree (line);
++i;
line = internet_readline (fp, is_strict, max);
}
is_good = expected[i] == NULL && line == NULL;
if (line != NULL) {
line = (char *) xfree (line);
}
fclose (fp);
} else {
printf ("\n");
perror ("fopen");
printf ("\n%s:%d:", __FILE__, __LINE__);
printf (" Could not open %s to read from it.", pathname);
is_good = FALSE;
}
}
if (is_good) {
remove (pathname);
} else {
/*
* There was an error, so we don't delete the temporary
* file. It might be useful to inspect it later.
*/
}
return is_good;
}
/*
*/
#define entry(test) { &test, #test }
static struct {
int (*fn) ();
char *name;
} S_a[] = {
entry (S_Test0000),
{ NULL, NULL }
};
static size_t S_alen = sizeof S_a / sizeof S_a[0];
int
main ()
{
int rc = 0;
size_t i;
for (i = 0; rc == 0 && S_a[i].fn != NULL; ++i) {
printf ("\n%3d%% %s =>", i * 100 / (S_alen - 1),
S_a[i].name);
if ((*S_a[i].fn) ()) {
printf (" good");
} else {
printf (" FAIL");
rc = 2057;
}
}
printf ("\n");
return rc == 0 ? EXIT_SUCCESS : EXIT_FAILURE;
}
/* --- end of file --- */
Gene Michael Stover 2008-04-20