| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by graycat 4846 days ago

1) On malloc() and free(), right, I was free just to write my own. I should have. At various times since for various reasons, I have just written my own.

On your

"K&R and other good C references describe their public interface well and that's all you need to know to use them effectively."

I want more. By analogy, all you need to drive a car is what you see sitting behind the steering wheel, but I also very much want to know what is under the hood.

Generally I concluded that for 'effective' 'ease of use', writing efficient code, diagnosing problems, etc., I want to know what is going on at least one level deeper than the level at which I am making the most usage.

Your example of putting a 100,000 byte array on the stack is an example: Without knowing some about what is going on one level deeper, that seems to be an okay thing to do.

2) My remark about the stack is either not quite correct or is not being interpreted as I intended. For putting an array on a push down stack of storage, I am fully aware of the issues. But on a 'stack', maybe also the one used for such array allocations (that PL/I called 'automatic'; I'm not sure there is any corresponding terminology in C), there is also the arguments passed to functions. It seemed that this stack size had to be requested via the linkage editor, and if too little space was requested then just the argument lists needed for calling functions could cause a 'stack overflow'. A problem was, it was not clear how much space the argument lists took up.

Then there was the issue of passing an array by value. As I recall, that meant that the array would be copied to the same stack as the arguments. Then one array of 100,000 bytes could easily swamp any other uses of the stack for passing argument lists.

But even without passing big 'aggregates' by value or allocating big aggregates as 'automatic' storage in functions, there were dark threats, difficult to analyze or circumvent, of stack overflow. To write reliable software, I want to know more, to be able to estimate what resources I am using and when I might be reaching some limit. In the case of the stack allocated by the linkage editor for argument lists, I didn't have that information.

3) Sure, I could make use of the strings in C as C intended just as you state, just for textural data, but also have to assume a single byte character set.

I thought that that design of strings was too limited for no good reason. That is, with just a slightly different design, could have strings that would work for text with a single byte character set along with a big basket of other data types. That's what was done in Fortran, PL/I, Visual Basic .NET, and string packages people wrote for C.

The situation is similar to what you said about malloc(): All C provided for strings was just a pointer to some storage; all the rest of the string functionality was just in some functions, some of which, but not all, needed the null termination. So, what I did with C strings was just use the functions provided that didn't need the null terminations or write my own little such functions.

As I mentioned, I didn't struggle with null terminated strings; instead right from the start I saw them as just absurd and refused ever to assume that there was a null except in the case when I was given such a string, say, from reading the command line.

It has appeared that null terminated strings have been one of the causes of buffer overflow malware. To me, expecting that a null would be just where C wanted it to be was asking too much for reliable computing.

3) On casts, we seem not to be communicating well.

Data conversions are important, often crucial. As I recall in C, the usual way to ask for a conversion is to ask for a 'cast'. Fine: The strong typing police are pleased, and I don't mind. And at times the 'strongly typed pointers' did save me from some errors.

But the question remained: Exactly how are the conversions done? That is, for the set D of 'element' data types -- strings, bytes, single/double precision integers, single/double precision binary floating point, maybe decimal, fixed and/or floating, and for any distinct a, b in D, say if there is a conversion from a to b and if so what are the details on how it works?

One reason to omit this from K&R would have been that the conversion details were machine dependent, e.g., depended on being on a 12, 16, 24, 32, 48, or 64 bit computer, signed magnitude, 2's complement, etc.

Still, whatever the reasons, I was pushed into writing little test cases to get details, especially on likely 'boundary cases', of how the conversions were done. Not good.

Sure, this means that I am a sucker for using a language closely tied some particular hardware. So far, fine with me: Microsoft documents their software heavily for x86, 32 or 64 bits, from Intel or AMD, and now a 3.0 GHz or so 8 core AMD processor costs less than $200. So I don't mind being tied to x86.

On PL/I: Thankfully, no, it was not nearly the first language I learned. Why thankfully? Because the versions I learned were huge languages. Before PL/I I had used Basic, Fortran, and Algol.

PL/I was a nice example of language design in the 'golden age' of language design, the 1960s. You would likely understand PL/I quickly.

So, PL/I borrowed nesting from Algol, structures from Cobol, arrays and more from Fortran, exceptional condition handling from some themes in operating system design, threading (that it called 'tasking' -- current 'threads' are 'lighter in weight' than the 'tasks' were -- e.g., with 'tasks' all storage allocation was 'task-relative' and was freed when the task ended), and enough in bit manipulation to eliminate most uses of assembler in applications programming. It had some rather nice character I/O and some nice binary I/O for, say, tape. It tried to have some data base I/O, but that was before RDBMS and SQL.

In the source code, subroutines (or functions) could be nested, and then there were some nice scope of name rules. C does that but with only one level of nesting; PL/I permitted essentially arbitrary levels of nesting which at times was darned nice.

Arrays could have several dimensions, and the upper bound and lower bound of each could be any 16 bit integers as long as the lower was <= the upper -- 32 bit integers would have been nicer, and now 64 bit integers. Such array addressing is simple: Just calculate the 'virtual origin', that is, the address of the array component with all the subscripts 0, even if that location is out in the backyard somewhere, and then calculate all the actual component addresses starting with the virtual origin and largely forgetting about the bounds unless have bounds checking turned on. Nice.

A structure was, first-cut, much like a struct in C, that is, an ordered list of possibly distinct data types, except each 'component' could also be a structure so that really was writing out a tree. Then each node in that tree could be an array. So, could have arrays of structures of arrays of structures. Darned useful. Easy to write out, read, understand, and use. And dirt simple to implement just with a slight tweak to ordinary array addressing. So, it was just an 'aggregate', still all in essentially contiguous, sequential storage. So, there was no attempt to have parts of the structure scattered around in storage. E.g., doing a binary de/serialize was easy. The only tricky part was the same as in C: What to do about how to document the alignment of some element data types on certain address range boundaries.

Each aggregate has a 'dope vector' as I described. So, what was in an argument list was a pointer to the dope vector, and it was like a C struct with details on array upper and lower bounds, a pointer to the actual storage, etc.

PL/I had some popularity -- Multics was written in it.

For C, PL/I was solid before C was designed. So, C borrowed too little from what was well known when C was designed. Why? The usual reason given was that C was designed to permit a single pass compiler on a DEC mini-computer with just 8 KB of main memory and no virtual memory. IBM's PL/I needed a 64 KB 360/30. But there were later versions of PL/I that were nice subsets.

It appears that C caught on because DEC's mini computers were comparatively cheap and really popular in technical departments in universities; Unix was essentially free; and C came with Unix. So a lot of students learned C in college. Then as PCs got going, the main compiled programming language used was just C.

Big advantages of C were (1) it had pointers crucial for system programming, (2) needed only a relatively simple compiler, (3) had an open source compiler from Bell Labs, and (4) was so simple that the compiled code could be used in embedded applications, that is, needed next to nothing from an operating system.

The C pointer syntax alone is fine. The difficulty is the syntax of how pointers are used or implied elsewhere in the language. Some aspects of the syntax are so, to borrow from K&R, 'idiosyncratic' that some examples are puzzle problems where I have to get out K&R and review.

To me, such puzzle problems are not good.

I will give just one example of C syntax:

i = j+++++k;

Right: Add 1 to k; add that k to j and assign the result to i; then add one to j. Semi-, pseudo-, quasi-great.

I won't write code like that, and in my startup I don't want us using a language that permits code like that.

2 comments

billforsternz 4845 days ago

Well I certainly salute your passion. I am not nearly dedicated enough to go through this point by point. The stack issue comes down to this; C uses the assembly (i.e. machine) stack. It is an almost ridiculously simple mechanism ideally suited to pass parameters and allocate 'automatic' (yes this is a C term too) data. Avoid large aggregates and arrays on the stack because stack space is limited. Providing you adopt a conservative approach, you never have to worry, 90% of a 2K byte stack in a small embedded system is typically safety factor/headroom.

My personal view is that C offers a perfect tradeoff between simplicity and capability, it has a magical quality that has made it the most important single computer language for nearly 40 years and on into the forseeable future. Increasingly its importance is as a layer that more programmer friendly technology sits upon, but it's no less important for that.

I've read that the difference between chess and go (the oriental game, not golang) is that on Alpha Centauri if little green men play a game that resembles chess, they will almost certainly play a game identical to go. Go is simple enough it is almost inevitable. For me it's almost the same thing (I stress almost), with computer languages and C.

One final point; C syntax is ultimately a matter of taste. If you find this to be a completely obvious, correct and straightforward way of doing a non-overlapping C string copy;

  while( *src )
    *dst++ = *src++;
  *dst = '\0';

Then you 'get' C. If you find it a confusing monstrosity, maybe C isn't your language.

link

graycat 4845 days ago

So, with just 2KB, that machine stack is not the stack (of dynamic descendantcy', that is, the conceptual stack of routines called but not yet returned) for automatic storage. Good to know. So, if only pass pointers or 'element' variables, then a 2KB stack for parameter lists should be okay for small to medium programs unless the programmer actually believed his computer science course that said that recursive routines were good things!

Yes, for your code example, I 'get it'! It's cute! No doubt it's cute.

So, to get 'full credit', dst and src are 'strings', that is, essentially just pointers. Since C pointers have a data type with a length, here the data type of these pointers is byte, or character, or some such with length 1 byte.

Starting off, since src points to a C string that obeys the C standard of null termination, we know that the string has exactly 1 null byte, its last byte. So, for any byte except the last one, it is not null. So, if the string has length more than 1, then as we enter the While, src points to the first byte of the string (or at least the first byte we want to copy); * src is that byte; and * src is not null, that is, is not 0, that is, tests as True for the While (need to know that 0 tests as false and anything else tests as true). So we execute the statement following the While.

That statement says copy byte * src to byte * dst and, then, before considering the semi-colon and before collecting $100, increment both src and dst by the length of their data types, that is, by 1 byte. So, now src points to the next byte in its string and dst points to the target of the next byte in its string. Then we return to the While and do its test again. If we have more bytes to move, then we just rinse and repeat the above. Else src points to the last byte of its string which has to be the null byte so that the byte itself, * src, is the null byte, that is, 0, that is, False in the While statement. Then we leave the While statment, that is, move past its semi-colon. So, net, the While statement moves all the non null bytes we want moved. Of course when get to the next statement, dst points to the last byte of its string, that is, the byte that is to be its null byte (just why that byte is not null already is possibly an issue). At any rate, we want that last byte to be the null byte, so that last statement so assigns that last byte.

For more, 'src' abbreviates 'source' and 'dst' abbreviates 'destination'.

So, yes, it's possible to describe this stuff in English.

So, there's a problem, a significant problem: I 'documented' your code. Okay, but your code is for a string copy. There should be some documentation, but it should be in documentation of a string operation for the language. That is, even just for the documentation, the strings and the copy operation need to be 'abstracted' to a higher level where they can be documented and learned there and set aside the need to document in the code.

Of course, I would write such a copy loop, assuming I wanted to use a loop, using Fortran's Do-Continue, PL/I's Do-End, VB's For-Next. I don't recall the Algol syntax. The PL/I syntax is the same as in Rexx which I use heavily.

Of course in PL/I and VB, I would use the substring function instead of a loop.

And I would fear that too much usage of syntax as sparse as this example here would be more error prone. And for more complicated operations, I would fear that neither God nor the programmer understood the code. I can understand that on some processors with some compilers, that C syntax could lead to faster code, but I'm not thrilled about digging into x86 assembler enough to be sure. Also now it's tough to know what fast code is due to out of order execution, speculative execution, parallel execution, pipelines, three levels of cache, the cache set associative, and cache line invalidates when have several cores. But, computers are so fast now I don't much care, and if I did care I would notice that making such code faster would not make the code in the runtime library or the operating system faster and, thus, might not be able to do much for actual performance of my application.

I don't really mind your example; it's actually not sparse enough to be a serious problem. But I'm not thrilled by the example because I don't take pleasure in that clever sparsity and, again, I'm afraid that it could result in bugs that could hurt my business.

Then there's the issue, I regard as significant, that with that code the C compiler is very short on the 'meaning' of what is being done. Or, you and I can look, read, guess, and agree that we are doing a copy of the tail of one C string to another (or possibly the same if initially src equals dst) string. Okay, you and I can guess that. But the poor compiler can't, and if it tries then I might get torqued when in that loop actually I'm not working with C strings but doing something else. So, then, the compiler will have a heck of a time checking string lengths and not writing on the wrong storage. So, I'd rather have strings as a higher level construct, than just a pointer to some storage from malloc(), so that the language can help me debug my code. I've programmed so many errors in array bounds and string lengths that I don't want to be without some good checking at runtime, at least in some 'debug' mode.

Next, for being 'fast', on at least IBM 360/70, etc. computers, actually that code sample would be slow! Why? Because that instruction set has a single instruction to copy all the bytes in a range of sequential virtual addresses. So, if the compiler knows it is working with a string and knows it is compiling for that instruction set, then it can replace the whole loop with just one instruction.

There was an old remark in the IBM PL/I program logic manual on execution speed: People could complain about PL/I being slower than Fortran. But if PL/I were used at all carefully, then it was faster than Fortran, and one of the main reasons was that PL/I then but not Fortran had strings in the language. So, Fortran programmers wrote collections of string handling functions/subroutines, and the internal logic was much as in your example, move one byte at a time. And as in the standard C library, do this by calling a function/subroutine with its overhead. PL/I's compiled code for strings was in-line and blew the doors off anything in Fortran. For today, and for similar reasons, Visual Basic .NET has a chance to be faster in string manipulations than C code such as in your example. Further, it's super tough to do something with VB strings that would mess up memory.

I don't find your example "a confusing monstrosity", but I greatly prefer to bet my business on VB instead of C/C++.

link

billforsternz 4845 days ago

C strings are performant but place a lot of responsibility on the programmer. C++ strings offer a more accessible, less lightweight but easier to use facility. Rather typically of higher level string abstractions that are standard in non C languages (and which can be constructed as library functions in C), they rely on memory allocation and so will often be less performant.

By little example is basically the strcpy() standard facility. Maybe a better example would be a construct that (roughly) could replace memcpy();

  while( n-- )
    *dst++ = *src++;

This sort of thing just appeals to me as being simple and obvious computing - there's no cleverness to it - and certainly no need to break it down exhaustively to understand it. I think whether this sort of thing appeals might have something to do with prior experience - in my case as an engineer and assembly language programmer;

The equivalent to my memcpy() snippet on the original x86 machines was simply this;

  rep movsb

Put the count in cx, the source ptr in si, the dest ptr in di and the REP prefix will repeat the MOVE STRING BYTE instruction and decrementing cx each time until it hits zero.

link

billforsternz 4845 days ago

Incidentally, writing standard business apps in VB instead of C or C++ makes complete sense to me.

link

duaneb 4846 days ago

C is not perfect. It has its problems (strings, ++, horrible type syntax, no memory allocation, architecture idiosyncrasies like type width). However, it's reliable, fast, and you can basically memorize the language and compile it by hand if necessary. No other language will reliably run on many systems that fast with that much existing code.

anyway, if you think you can prevent bad code by using restrictive languages, you're gonna have a bad time. Any language can be abused. Just don't abuse it, treat your code with respect.

Also I'm pretty sure j+++++k has undefined behavior so you should be shot if you write it.

> in my startup I don't want us using a language that permits code like that.

Well I hope you don't run unix or windows, python or ruby, Firefox, chrome, ie, safari, or opera, or use a smartphone.

link

graycat 4845 days ago

"C is not perfect." Yup, it has some "dark corners" or whatever want to call its flaws.

"No other language will reliably run on many systems that fast with that much existing code." Yup, and just such reasons are why at times I used it. It's in effect also why I'm using some of it now although mostly my code is in Visual Basic .NET (VB): The Microsoft VB documentation is fairly clear ('platform invoke' or some such) on how to call C from VB. Well, I have some old Fortran code I want to call from VB, do have the old Watcom Fortran compiler, but do not have details on how to call Fortran from VB. So, I used the old Bell Labs program f2c to convert the Fortran to C, used the Microsoft C compiler to compile and link to a DLL, then call the DLL from VB. And actually it works. And in effect the reason I can do this is what you said: C is so popular, for the reasons you gave, etc., that Microsoft went to the trouble to say how to call C from VB. Microsoft didn't do that for Fortran, PL/I, Algol, Ada, etc. You are correct that the popularity of C is important.

"anyway, if you think you can prevent bad code by using restrictive languages," Right. Each such restriction eliminates only some cases of bad code.

> Any language can be abused. Just don't abuse it, treat your code with respect.

Right. There is "When a program is written, only God and the programmer understand it. Six months later, only God." Well, so that I could read my code six months later, I wrote only easy to read code. So, I would write

n = n + 1

and not the short version, and would never write i+++++j.

> > in my startup I don't want us using a language that permits code like that.

> Well I hope you don't run unix or windows, python or ruby, Firefox, chrome, ie, safari, or opera, or use a smartphone.

You lost me: I'm using VB and find nearly all the syntax to be fine, that is, easy to learn, read, and write. And the main reason I'm not using C# is what it borrowed from C/C++ syntax. I'm using C only when really necessary. Sure, I use Windows and Firefox; if they are written in C/C++, that's their issue. But by staying with VB, I am successful with my goal of

> in my startup I don't want us using a language that permits code like that.

> Also I'm pretty sure j+++++k has undefined behavior so you should be shot if you write it.

As I recall, I actually tried it once, and it compiled and ran as I explained. And, as you explain, if it works in one heavily used C compiler, then it should work the same in all of them. If look at j+++++k, I suspect that it parses according to the BNF just one way with no ambiguity. So, don't have to write, say,

(j++) + (++k)

link

pests 4845 days ago

According to the linked presentation, slide 13:

"The C specification says that when there is such an ambiguity, munch as much as possible. (The "greedy lexer rule".)"

So j+++++k turns into:

j++ ++ + k

Which is clarified on the next slide.

link

graycat 4845 days ago

Wow!

I would have guessed that j++ ++ was not legal syntax.

So, I was wrong: There are two ways to parse that mess. So, there is ambiguity. And the way they resolve the ambiguity is their 'greedy' rule! Wow!

Net, that tricky stuff is too tricky for me.

There was a famous investor in Boston who said that he only invests in companies only an idiot could run well because the chances were too high that too soon some idiot would be running the company.

Well, I want code, or at least language syntax, that any idiot can understand, for now, me, and later some of the people that might be working for me!

You are way ahead of me on C, and you leave me more afraid of it than I was. But then I was always afraid of it and, in particular, never wrote ++.

link

graycat 4843 days ago

Okay, some clarity from actually running some simple code! Or if K&R didn't make a lot of details clear to me in my fast reading, then maybe some simple test cases will!!!

So, my first issue was the statement for C

     i = j+++++k;

So, to make some tests, I dusted off my ObjectRexx script for doing C compiles, links, and execution.

Platform: Windows XP SP3 with recent updates. And apparently somehow I have

     Visual Studio 2008 x86 32 bit

installed, and it has relevant "tools", e.g., a C/C++ compiler, linker, etc.

I don't use IDEs or Visual Studio and, instead, apparently as a significant fraction of readers at HN, write code with my favorite text editor (e.g., KEdit) and some command line scripts (using ObjectRexx, which is elegant but for better access to Windows services, etc. likely I should convert to Microsoft's PowerShell).

So, I typed in some C code and tried to compile it. Then I encountered again one of the usually unmentioned problems in computing: Software installation and system management. Several hours later I had a C/C++ 'compile, load, and go' (CLG) script working, but my throat was sore from screaming curses at the perversity of 'system management' -- a project of a few minutes with a prerequisite of several hours of system management mud wrestling.

For the mud wrestling, the first problem was, since my last use of C, I had changed my usual boot partition from D to E. Next the version of C installed on E was different from that on D. And the installation on D would not run when E was booted. Bummer.

Next, the C compiler, linker, etc. want a lot of environment variables. Fine with me; generally I like the old PC/DOS idea of environment variables.

However, apparently Microsoft was never very clear on just what software, when, could change the environment variables where. At least I wasn't clear.

So, booting from my partition E, the C/C++ tools want environment variables set as in

     E:\Program Files\Microsoft Visual Studio 9.0\Common7\Tools\vsvars32.bat

Okay. Nice little BAT file.

If run the BAT file from a console window, it changes the environment variables as needed by C/C++. But, in console windows I run a little 'shell script' I wrote in ObjectRexx. I has a few nice features for directory tree walking, etc. But when run the BAT file from the command line of a console window that is running my little shell script, after the BAT file is done and returns, the environment variables have been restored to what they were before running the BAT file. If use a statement, say,

     set >t1

at the end of the BAT file, then file t1 shows that the environment variable values have been changed while the BAT file was still running.

So, sure, there is a 'stack' of invocations of processes, applications, or whatever in the console window and its address space, and, somehow, since my shell script was in the stack, when the BAT file quit the stack and its collection of environment variables was popped back to what they had been.

But eventually I relented, gave up on this little project taking just a few minutes, slowed down, thought a little, read some old notes, discovered that I should change the environment variables within my ObjectRexx script, using an ObjectRexx function for that purpose, as needed by C/C++ CLG, found the needed changes, implemented them, and, presto, got a C/C++ CLG script that works while my shell script is running and while I am booted from my drive E.

On to the C question:

For 'types', the test program has

     int i, j, k;

For

     i = j+++++k;

my guess was that this would parse only one way,

     i = (j++) + (++k)

and be legal. And as I recall, but likely no longer have good notes, some years ago on OS/2, PC/DOS, or an IBM mainframe,

     i = j+++++k;

was legal.

Not now! With the C/C++ tools with

     Visual Studio 2008 x86 32 bit

statement

     i = j+++++k;

gives C/C++ compiler error message

     error C2105:  '++' needs l-value

So, that's an L-value or 'left value' or something that the 'operator' ++ can increment.

So, it wasn't clear how the compiler was parsing. So, I tried

     i = j++ ++ +k;

and it also resulted in

     error C2105:  '++' needs l-value

So, likely the ++ that is causing the problem is the second one.

So, I tried

    i = (j++)++ + k;

and still got

     error C2105:  '++' needs l-value

Then I tried

    i = j++ + ++k;

and it worked as would hope: k was incremented by 1 and added to j, the sum was assigned to i, and then j was incremented by 1.

Then I tried

    i = j+++k;

Surprise! It's legal! j and k are added and the sum is assigned to i, and then j is incremented by 1.

So, I long concluded that to understand some of the tricky, sparse syntax of the language, not clearly explained in K&R, have to write and run test cases as here. Bummer. But, as below, here I'm significantly wrong.

Possible to make sense out of this?

Maybe: If start reading

Brian W. Kernighan and Dennis M. Ritchie, 'The C Programming Language, Second Edition', ISBN 0-13-110362-8, Prentice-Hall, Englewood Cliffs, New Jersey, 1988.

in "Appendix A: Reference Manual" on page 191, then hear about 'tokens' and 'white space' to separate tokens.

Okay, no doubt + and ++ are such 'tokens'.

Continuing, right away on page 192 have

"If the input stream has been separated into tokens up to a given input character, the next token is the longest string of characters that could constitute a token."

I would have said "up to and including a given input character", but K&R are 'sparse'!

So, with this parsing rule, in

     j+++k

the tokens are

which is essentially

     (j++) + k

which is legal, but in

     j+++++k

the tokens are

which would be essentially

     (j++)++ + k

where the second ++ does not have an 'L-value' to act on.

So, my remark that

     j+++++k

can parse only one legal way is irrelevant because that is not how the C parsing works.

Basically I was assuming a 'token getting' parsing rule like I've implement a few times in my own work: There are tokens and delimiters, and a 'token' is the longest string of characters bounded by delimiters but not containing a delimiter. The delimiters are white space, (), etc.

K&R seems to have a point: My parsing rule would have trouble with just

     j>=k

and, instead would require writing

     j >= k

which I do anyway.

Generally, though, the C syntax is sparse and tricky, so tricky it stands to be error prone.

Back to writing Visual Basic .NET.

link