| HN Mirror

Schemathesis author here. I hope to clarify a few points here

> From my understanding, Schemathesis can generate data based on a value being a string, number, boolean, etc

Schemathesis can generate data that matches the spec or not based on the config option (specifically meaning JSON Schema based validation) including all the formats (e.g. date, etc) defined by the Open API spec. For GraphQL it supports all built-in scalar types + a handful of popular ones like DateTime or IP. With extra configuration can also generate syntactically invalid data (e.g. invalid JSON). Serialization is a different step - the payloads can be serialized to JSON or XML, YAML, etc. In my private extension, I also use a Python version of `faker` to mix more realistic data into the set.

> It also seems fairly manual to set up and has a learning curve. Our output is JavaScript that can be run anywhere.

The simplest one-off run is `st run <SCHEMA>`, and it is not clear to me what you mean by being fairly manual to set up. If the user already has a schema (or derived it from traffic / generated by a framework, etc), the only thing they need is to invoke the CLI. Surely there are many config options for different scenarios, and one may take more effort to configure than the other.

Everything has a learning curve - more interesting aspects would be whether this learning curve is justifiable and how often the user needs to dive deep into configuration. My aim with Schemathesis is that in 90% its defaults should be enough for most of the users, for the rest 10% there should be as few barriers as possible for the user to accomplish their goal (which often generates data that has a higher probability to uncover defects).

> From there, it automatically generates the correct type and format of data - e.g., if a field is named "address," it generates a value that looks like an address and is formatted in the same way as examples. It wouldn't be practical to cover every potential edge case and scenario without AI.

From the point of view of coverage of the edge cases, the description sounds like a happy-path scenario. What about the deviations?

Also, most fuzzers do a pretty good job in terms of covering edge cases without AI, especially greybox ones. What would be the concrete AI contribution here? Or what is the core difference in covering with AI or without it?