Hacker News new | ask | show | jobs
by yevyevyev 882 days ago
From my understanding, Schemathesis can generate data based on a value being a string, number, boolean, etc. It also seems fairly manual to set up and has a learning curve. Our output is JavaScript that can be run anywhere.

With our TestGen feature, the AI looks at example requests in a HAR file or Swagger examples, or it can solely rely on the name of the property. From there, it automatically generates the correct type and format of data - e.g., if a field is named "address," it generates a value that looks like an address and is formatted in the same way as examples. It wouldn't be practical to cover every potential edge case and scenario without AI.

2 comments

Also, an important note would be that Schemathesis is a property-based testing tool, which does not necessarily imply load testing. I.e. the comparison might not be that helpful as tools have different goals.

However, Schemathesis can use targeted property-based testing to guide the input generation to values more likely to cause slow responses, i.e. it can maximize the response time and at the end, the user can discover that passing `limit=100000000` will read the whole DB table and cause a response timeout (which is a trivial example though)

Schemathesis author here. I hope to clarify a few points here

> From my understanding, Schemathesis can generate data based on a value being a string, number, boolean, etc

Schemathesis can generate data that matches the spec or not based on the config option (specifically meaning JSON Schema based validation) including all the formats (e.g. date, etc) defined by the Open API spec. For GraphQL it supports all built-in scalar types + a handful of popular ones like DateTime or IP. With extra configuration can also generate syntactically invalid data (e.g. invalid JSON). Serialization is a different step - the payloads can be serialized to JSON or XML, YAML, etc. In my private extension, I also use a Python version of `faker` to mix more realistic data into the set.

> It also seems fairly manual to set up and has a learning curve. Our output is JavaScript that can be run anywhere.

The simplest one-off run is `st run <SCHEMA>`, and it is not clear to me what you mean by being fairly manual to set up. If the user already has a schema (or derived it from traffic / generated by a framework, etc), the only thing they need is to invoke the CLI. Surely there are many config options for different scenarios, and one may take more effort to configure than the other.

Everything has a learning curve - more interesting aspects would be whether this learning curve is justifiable and how often the user needs to dive deep into configuration. My aim with Schemathesis is that in 90% its defaults should be enough for most of the users, for the rest 10% there should be as few barriers as possible for the user to accomplish their goal (which often generates data that has a higher probability to uncover defects).

> From there, it automatically generates the correct type and format of data - e.g., if a field is named "address," it generates a value that looks like an address and is formatted in the same way as examples. It wouldn't be practical to cover every potential edge case and scenario without AI.

From the point of view of coverage of the edge cases, the description sounds like a happy-path scenario. What about the deviations?

Also, most fuzzers do a pretty good job in terms of covering edge cases without AI, especially greybox ones. What would be the concrete AI contribution here? Or what is the core difference in covering with AI or without it?