Hacker News new | ask | show | jobs
by maeln 2212 days ago
Just on a minor note. Having taught programming classes in the past, don't use automated testing for grading, or at least not fully.

First of, there is a lot more to judge in a program than just the correctness of its output. Is the code readable ? Re-usable ? Did the student understood the core concept of the language and used them correctly ? I saw some student fail to produce a correct program but had a well organized code. On the other hand, one of my student once created a program that was just made of switch instruction in switch instruction (up to 5 level of switches) and spread for more than 3000 lines, just because he failed to understand how function where working.

You can try to plug a linter in your testing stack, a quality gate, all kind of software to measure the code quality, but it will never be truly fair. It will make your evaluation a game where the student challenge is not understanding the concept of the class, but instead, understanding the rules of the various test you put in place. And those rules tend to be even more arbitrary with than the teacher himself.

Trying to automate grading is an engineer trying to apply its vision to education, and it my honest opinion, it is an awful idea. Providing tools to facilitate education is awesome, trying to automate education is a dystopia.

12 comments

I had one professor who used private (30%) and public (70%) tests. You could iterate on the public tests up until the submission deadline. But you never saw the private tests until the professor ran them on your program after you submitted it.

I think this was a good middle ground that forced you to think beyond the rigid parameters of the public tests, and discouraged things like “switch” statements.

Quality of code was also a component, as determined by a TA.

I graded for a class that took a similar approach to this - the students had a small amount of tests available to them that allowed them to iterate over the requirements for the assignment, but we had more exhaustive tests that we ran automatically. In addition to checking test completeness, we also looked at the source code itself and graded for quality of that. The automated runner handled a lot of the hard work for us, but it definitely wasn't a situation where we saw some failed unit tests and then proceeded to fail them or anything.

The system used to pull this all off is pretty interesting, I think. It's all open source: https://github.com/redkyn

Ha, I did something very similar also with Gitlab and runners. I was always amazed at the new and creative ways students managed to break the autograders we wrote -- particularly for our BASH scripting assignments.

I was also never brave enough to have my autograders submit grades directly to Canvas after the first time I did it. I confidently had it directly submit grades the first week during the first year I implemented autograders. Half the class forgot a shebang on their shell script, and the autograder happily gave them a 0.

For bonus fun times: Gitlab's runner timeout is pretty much nonexistent if you're using the shell runner. It just won't kill them if they take too long, so I had to write our own management system for runners. Silly university IT not support containers :(

One of my favorite classes had a few simple tests provided by the professor, and the rest of the tests were provided by students. Our grade was based on our score on the overall test suite, combined with the quality of our own provided test case, our code quality as judged by a TA, and an external writeup. This was fun because it encouraged students to create brutally hard tests to cover all sorts of corner cases in a way that was openly available to everyone.
You make a good point to not use automated testing fully. I would highly recommend using some level of automated testing in every classroom.

Without automated testing, a lot of energy is spent on making sure the program runs and does what it's supposed to. Since students use all different kinds of environments (especially in beginner courses), just getting the program to run is a big challenge. Lint reports are great and I recommend adding them as part of the automated testing stack.

One huge advantage of automated testing is that students know before submitting that they understood the problem correctly, their program does what it's supposed to, and they are submitting it in the format that the instructor is expecting it in. A realtime feedback loop will always result in better submissions and grades.

Of course, once the basics are out of the way, instructors must look into the code to make an evaluation of whether the problem was solved the "right way" - whatever that means in the context of the course.

Source - I maintain a tool that automatically grades CS assignments and have collected a lot of data points over the last few years.

That would be lovely in a world where the teacher-to-student ratio was say 20:1 rather than 200:1. Most major CS programs use automated grading for lower-level programming classes; check out Berkeley's huge infrastructure and toolset for their curriculum. There just isn't enough money (or competent TA's, often) to give good individual feedback on code style in intro courses.

Perhaps it is our own fault for somehow giving administrations the impression it is possible to teach well at scales where individual attention is impossible. I don't know.

I've taught a lot of programming courses that have 20:1 or even lower ratios.

I always use automated grading to check for correctness. That way I can give more attention to other aspects of the grading rubric (a bit of style, but most of the time just understanding the thing we're supposed to be learning -- pointers, for loops, specific data structures, etc.)

For smaller programming assignments, you need to check that the right language constructs are used in the right way. E.g., you really have to go in and read to program to be certain that students are using for loops correctly. Just because the code outputs the correct answer and uses the "for" keyword, doesn't mean that the student understands how to use a for loop. I don't think it's possible to automate this check.

Similarly, it would be insane to grade a large programming assignment based on test cases alone because a) "program design" is a major part of the assignment and b) it'd be impossible to get full coverage without an insane amount of work.

I think it's a bad idea to not use automated grading, even in 20:1 classes. But of course you should also be reading the code and commenting on it.

Back when I was college (and that was a while ago now), the 2nd or 3rd year CS students would grade the 1st year CS class (which was required curriculum for all 180 students across all of our technical majors). While we did get paid to do so, it wasn't a lot. There was a strong cultural norm that doing "grutoring" (grading and tutoring) was just what you did.

I suspect that utilizing near peer style grading would be fairly scalable.

I do understand the issue. I personally only taught in small classes of less than 30 student, and most of my own education was in small classes to. Even with fewer student, it took my a lot of time to prepare classes, evaluations and then grade them in a way that I perceive was "fair". I don't think it would have been possible if they were twice the number of student.
My CS course in Italy was 120 people and tests were checked one by one by the teacher.
> It will make your evaluation a game where the student challenge is not understanding the concept of the class, but instead, understanding the rules of the various test you put in place.

I agree with most of what you have to say, however you're arguing for replacing one 'game' with another. If the goal is not to understand the grading engine then by the same logic the goal is to understand what appeals to whom is marking it.

I do agree that you are replacing one subjectivity (the tests and tools rules) with another (the teacher view of what is a good code).

But the idea is that a teacher is not as strict as a tool and is not here to punish student, but to instead grade them fairly in order for them to see what they can improve, and provide assistance for them to improve on those point. Well at least in theory, its not that easy to do :) .

The main reason why I disliked automated grading was that, a lot of time, I could see that a student almost got the right answer, but either didn't have the time to finish, or made some small mistakes. A automated tool would have given him 0, the same grade that a student who failed to understand anything or didn't work would have had, which I see has highly unfair. He cannot have all the point, but he should have some.

Not valuing the work of students is the quickest way I found to demoralize student and ultimately have them fail the class. And this is the opposite of what a teacher should strive for.

That's exactly the point. There is more subjectivity and nuance, and the person marking it is forced to engage in a kind of dialog with the actual code and see it for itself instead of relying on some limited objective heuristic which has no capacity to understand the student and what the student is misunderstanding.
Agree, automated testing is good as far as it goes but really shouldn't be the whole story with grading.

Another fun tool is CrowdGrader (https://www.crowdgrader.org/). It's a nice way to distribute reviewing/grading throughout the class to make it more scalable. When I was teaching I really liked it, it meant I could assign more interesting creative assignments at a faster pace. It's a different type of "automation".

Back when I taught CS, I had a colleague who did this. He ran the program and compared the output char by char to what he expected. The assignments went into great detail on the expected format of the output (column-by-column). I believe this was all he did, no check of the source code. The sad thing was he was the chair of the department when I first got there. Fortunately, he retired soon afterward.
You know, this isn't a great way to teach in general, but I wouldn't mind it being "in the mix". There's a fair number of people out there who can program but don't have the discipline to understand a somewhat exacting specification.

Someone else writes of all assignments getting a public test suite, which sounds great most of the time but is even worse in making sure that students can understand what an assignment is actually asking.

I'm about to start teaching a middle school competition robotics elective. One of the two difficult grading measures is going to be tests (both closed book and open book) about what the rulebook says and means. (The other will be documentation. All the rest will be subjective and easy and squishy and "easy A").

> I saw some student fail to produce a correct program but had a well organized code.

I don't necessarily agree with this. Programs are not designed to be readable, they are designed to solve a problem. The first goal is to get the correct program, and only after, it is important to make the code readable. A readable code that doesn't work is not worth much.

As a former student, I think automatically graded assignment are very nice because they force you to have a programmer mindset (the smallest error / typo will make your code fail).

If you want to grade also the organization of code, fine, but I think the first one is more valuable (how to organize code properly is not learned at school anyway, it is learned by months / years of practice).

> Programs are not designed to be readable, they are designed to solve a problem.

This is 100% incorrect. If a developer submits a PR that works and if fast, but cannot be understood by the rest of the team, it can, should, and will be rejected.

80% of software's lifetime is spent in maintenance. That clever hack or ugly patch might work today, but you're going to hate yourself for it in a month.

There are exceptions, but they're by far the minority case and should be clearly marked as "deep magic".

https://en.wikipedia.org/wiki/Magic_(programming)

The broader theme of that comment was that ‘correct’ is table stakes.

A readable incorrect program is worse than an unreadable correct one. That neither is acceptable doesn’t negate that.

> A readable incorrect program is worse than an unreadable correct one. That neither is acceptable doesn’t negate that.

I will happily fix a readable but non-working code rather than maintain a working but unreadable code any day under the sun.

We are talking about teaching. You guys wonder why students graduate who can't code- professors who passed them because their shit that doesn't work looked nice.
> We are talking about teaching.

I'm heeding the way of my professor. She cut serious points from non-working code as expected but commented on our code in great extent regardless of its state.

She one to one explained to all of her students why their code didn't work. She similarly reduced grade of a working but, non-readable code and she clearly told why she did that. Also you got bonus points for informed experimenting, going the extra mile.

She always asked us hard tracing questions in exams. I complained once: "Why do I need to compile this in my mind while the compilers can do all the job?" She calmly answered: "If you can't compile that code in your mind, the compiler can't do it either.". It took me 6 years to sink that in but, when it did, I really enlightened.

With her style, you got to write working code to pass the course, regardless of all bonus points. If you can't code, you can't pass the exams either. She was great because of this.

Professors shall teach to write readable and working code, there's no exception. She was actively developing NLP systems when I last talked with her. I develop scientific applications. Both of these fields create convoluted code by default so, writing readable code is a really great ability to have.

Students who pass who can't code aren't the ones doing the coding (clear or unclear) in my experience. When doing homework, these students focus on higher level concepts and let the people who love coding do the actual coding.

This is based on my actual experience at university, not a random guess. It's never a case of "teachers weren't strict about working code", it's a case of "student didn't focus on coding, let others do it". Note that in proper CS (as opposed to a "coding school" or bootcamp) this is semi-acceptable, since CS is much more than just coding.

I do find it puzzling, because how can you study CS if you don't enjoy coding? But surprisingly enough, a lot of people who study CS don't actually love coding. Some actually love maths.

Your comment harks to a much broader concept. Are Computer Science programs supposed to produce software engineers or people grounded in the theory?

My own opinion is that separate tracks should exist.

Grading students more harshly without modifying the focus of their studies simply won't produce better people in the field.

There's wisdom out there that an incorrect but clear and well-commented program is better than a correct but unclear one. This is quoted at least in O'Reilly's "Practical C Programing" [1], which is a pretty good book. A lot of programmers believe this, too. I think the concept has merit (without taking it to extremes, of course).

[1] https://www.oreilly.com/library/view/practical-c-programming...

Unfortunately, correct is a function of time, and almost all correct programs become incorrect at certain points of their lives. Maintainability is a thing, and I've learned to lean pretty sharply in the direction of just shitcanning unreadable code.

As a corollary, unreadable code almost always seems to have a lot of latent bugs that simply haven't been observed yet.

I was a student, a software engineer and a teacher :) ! I have to strongly disagree with you. The programmer mindset is not about the smallest error / typo make your code fail. This is just a restriction of the tool which you need to be able to manage if you want to use it. The electrician mindset is not about knowing which type of screwdriver to use.

As a programmer, the substance of your job is to solve a problem, which you do using various tools. Your ability to understand the problem and properly translate it within the restriction of your tool makes you a good programmer. Making no typo doesn't make you a good programmer, especially these days where you can usually really on extensive CI stack to check for this kind of small mistakes.

And, I do agree that producing a program that is correct is the first goal, but it doesn't mean that other consideration such has the quality of the code shouldn't be valued, has you said. In the same way, properly organizing code may not be the first goal, but it doesn't mean that you shouldn't reward student who clearly tried to think about their code. And, I would argue that a lot can be learned in school :) .

Overall it is all about incentive. By grading automatically, you are targeting the lowest bar possible. By grading also other aspect of the student work, you also give him intensive to work on those aspect, making him not just a machine that can pass unit test, but also a person that can reason about its own work and try to improve it.

> Programs are not designed to be readable, they are designed to solve a problem.

No. Code should be readable. When a code is fresh only you and God knows what it does and how it does. 6 month later, only God knows how your code works. You need to re-read and understand it to grasp it again.

Unless you marked all magic with big comments blocks or you wrote your code explicitly, it'll take some headache-inducing hours to re-understand it.

> The first goal is to get the correct program, and only after, it is important to make the code readable. A readable code that doesn't work is not worth much.

Again no. You can leave a non-working but readable code overnight and understand the problem tomorrow morning. You can't remember random noise after a good night's sleep (unrelated: This is why regex is hard for our brains).

Related read: http://raganwald.com/2013/04/02/explicit-versus-clever.html

"Programs are not designed to be readable, they are designed to solve a problem."

Programs are written to solve a problem. They are designed to be understood by others who might have to debug or modify them due to new requirements.

> Programs are not designed to be readable, they are designed to solve a problem

I assume you've not had much experience with any legacy code where: if statements span hundreds of lines, variables have cryptic names which mean nothing to anyone but the original author (who is gone), and a distinct lack of comments.

I'm in the middle of such a project, and the lack of readability is severely inhibiting my ability to move forward. In fact, I've had to write software to parse this source code, just to follow the flow of logic.

This might be an extreme example, but it highlights how important readability is.

Agree. Solving the problem is first and foremost. Looking pretty, being extensible, well commented, etc come secondary- especially when teaching.
I believe that there is a strong argument to be made for the middle ground. Like you, I have been both a teacher and student, often at the same time, and was exposed to the trade-offs that any educational paradigm will have to make.

My current favorite approach is the one taken by Udacity in their "Nanodegree" programs. Much of the coursework is tied to autograders, which offer quick (if not very nuanced) feedback on the work performed. The culmination of each unit is a project evaluated by a human being, allowing for guidance on a more personal level.

In the absence of 1:1 tutoring, a hybrid model like this would be the base for my ideal class structure.

I have used automated when doing online courses, I will not worry about assess the correctnes, because you don't have to take that as the final result. What I did was, if it passes then I will review in depth, and check all you mention.

As in person-to-person interaction when teaching, you don't have to rely in a single way of assess if the student has understand a concept correctly, a Mediocre teachers does that

I had a very frustrating experience in an operating systems class, the first assignment was a few very basic C programs (pointer use, C strings, etc.). I used some C99 features that didn't compile without a flag in the version of GCC the grader used, so I just got a zero even though the programs were correct.
Idea: since abstractions like functions exist in order to help you prevent bugs in sufficiently complex systems being manipulated iteratively, is the problem not that testing for correctness is bad form, but that the assignment wasn't sufficiently realistic for them to need the abstractions?
I wonder if your student with nested switches was just bored with the assignment. Way back in college we had mandatory lab time which was cake for anybody with a little coding experience and I remember doing stuff like that, overflows, bit arithmetic, convoluted recursion logic and so on to see if anyone would notice. Nobody did... or so I thought.