Hacker News new | ask | show | jobs
Launch HN: Codeparrot (YC W23) – Automated API testing using production traffic
127 points by royal0203 1183 days ago
Hi HN, we’re Royal and Vedant, co-founders of CodeParrot (https://www.codeparrot.ai/). CodeParrot automates API testing so developers can speed up release cycles and increase test coverage. It captures production traffic and database state to generate test cases that update with every release.

Here’s a short video that shows how it works: https://www.loom.com/share/dd6c12e23ceb43f587814a2fbc165c1f .

As managers of engineering teams (I was CTO of an ed-tech startup, Vedant was the founding engineer of a unicorn company) both of us faced challenges in enforcing high test coverage. We ended up relying a lot on manual testing but it became hard to scale, and led to reduced velocity and higher production bugs. This motivated us to build CodeParrot.

How it works: we auto-instrument backend services to capture production traffic. Requests and responses coming to your backend service, as well as the downstream calls made by it like DB calls are stored. As part of your CI pipeline, we replay the captured requests whenever your service is updated. The responses are compared with the responses from production env and regressions are highlighted to the developers. To ensure that the same codebase gives the same response in CI environment and production, we mock all downstream calls with the values from production.

Most tools to record and replay production traffic for the purpose of testing capture traffic on the network layer (as sidecar or through load balancer), CodeParrot instead relies on an instrumentation agent (built on top of OpenTelemetry) to capture traffic, enabling us to capture downstream request/response like database responses which are otherwise encrypted on network layer. This helps us mock downstream calls and compare the response from CI environment vs production environment. Additionally, this helps us sample requests based on code flow and downstream responses which provide better test coverage compared to just relying on API headers & parameters.

Our self-serve product will be out in a few weeks. Meanwhile, we can help you integrate CodeParrot, please reach out at royal@codeparrot.ai or you can choose a slot here - https://tidycal.com/royal1/schedule-demo. We’ll be selling CodeParrot via a subscription model but the details are TBD. In addition, we will be open sourcing the project soon.

If you’ve already tried or are thinking of using tools in this space, we’d love to hear your experience and what you care about most. We look forward to everyone’s comments!

21 comments

Nothing gives you confidence in your system than testing on high variance sample of last 4 months of production traffic.

Especially for rewrites or code refactorings.

I've built this system at many companies myself. Never thought of doing this as a service for others.

Do you issue new sessions and modify the requests?
Modify requests - yes!

New sessions - What is the use case for it? Its definitely possible.

Don't your sessions expire? How do you authenticate to the endpoints?
So true!

> I've built this system at many companies myself.

Interesting! Would I know any of those companies/products?

Amazon has similar systems operating since last 10 years.
Few missing details that are crucial to usage within an organization:

1. what is the type of service instrumentation needed to capture the data? Wonder why this is needed when typically the data is already captured in an APM log? The instrumentation might add performance and security concerns.

2. what is the sampling logic to capture the traffic? It might compromise the fidelity of the test data and give a false sense of test accuracy.

3. what is the duration of data capture? Is it a week's or month's or quarterly data? Meeting 90% coverage on a week's production sample data will provide a false metric.

4. can it faithfully handle data privacy and customer anonymization? This is critical for API's dealing with PCI and other sensitive data.

Yeah, 4 is key. Many privacy regulations stipulate that account data must be deleted within a certain period of time, usually days or less, after a requested account deletion. In this system, all recorded requests would have to be discoverable by the requestor's ID and production systems would have to remember to perform deletions when necessary. Also, this database and all related testing systems would have to be held to production level standards for data access because anyone who can see test data to root cause errors can see people's and business' real, private information. Especially for data controlled and regulated industries like government and health care, this would be a nightmare.

It's a neat idea. These kinds of systems often require lots of care and grooming. Since it's used to retroactively test features after they're in production, there's a repeating process of discovering we're saving data we shouldn't, scrubbing, filtering, anonymization, etc. In most cases, I've watched them eventually get replaced by fuzzers. Still, having a central service used by lots of companies may allow this solution to scale up, develop necessary features to solve these problems and function well. I hope it works out!

> 1. what is the type of service instrumentation needed to capture the data? Wonder why this is needed when typically the data is already captured in an APM log? The instrumentation might add performance and security concerns.

Implementation is very similar to an APM log. So the same performance and security concerns apply. We are working on giving both at the same time (Automated tests, and APM), to reduce overhead.

> 2. what is the sampling logic to capture the traffic? It might compromise the fidelity of the test data and give a false sense of test accuracy.

It is random sampling. I feel, 1M or 10M randomly sampled requests should cover all cases.

> 3. what is the duration of data capture? Is it a week's or month's or quarterly data? Meeting 90% coverage on a week's production sample data will provide a false metric.

I was thinking 1 week should be enough. Maybe we will have to add some custom sampling logic for lesser frequency calls (like monthly crons).

> 4. can it faithfully handle data privacy and customer anonymization? This is critical for API's dealing with PCI and other sensitive data.

Yes. Additionally, for compliance, we offer a self-hosted solution- Our code runs on your servers and no data ever leaves your cloud / on-prem.

I've been having to put together a test suite at $work similar to what the product tries to offer. If there were a sufficiently advanced product to buy, I'd petition for it to be adopted at my org.

> It is random sampling. I feel, 1M or 10M randomly sampled requests should cover all cases.

1. I suggest providing alternate approaches to sampling: The input itself may have bias towards a single use case. If 70% of the input exercises the same code path, there's no benefit to having a uniform sample. Ideally it would be stratified amongst customers, or perhaps on other dimensions to allow for covering the most surface area.

2. Requests don't happen in a vacuum. They likely have data dependencies on prior requests. I recommend some way of sampling sessions rather than individual requests. Replaying the 3rd request in a series of 6 is likely just going to be exercising failure paths.

3. Behaviors may vary between requests with respect to time. If requests were sampled over a number of days but replayed within a short time period, there are behaviors that could differ from what actually occurs in production.

I didn't see any explanation on how results are determined. I think it's important to surface those types of details on the website. I'm not going to watch the video on it in hopes of learning.

> 2. Requests don't happen in a vacuum. They likely have data dependencies on prior requests. I recommend some way of sampling sessions rather than individual requests. Replaying the 3rd request in a series of 6 is likely just going to be exercising failure paths.

This is so true. I ran into this problem when I was trying to implement a kind of sampling that limited total RPS. Thanks for reminding. I am definitely looking into more sophisticated methods of sampling.

> I didn't see any explanation on how results are determined. I think it's important to surface those types of details on the website. I'm not going to watch the video on it in hopes of learning.

Good feedback! :)

Congratulations on the launch folks. Will be interesting to see the open source product and offering. Self hosted version of this with ease to change the environment will some great addition.

BTW, good folks at https://www.hypertest.co are doing something on same lines here in India.

Cheers!

Thank you! Will keep you posted on open source version.

We have come across Hypertest, seems pretty cool and useful.

If you are looking at this. You might be also be interested in Speedscale [1]. They have been around for a while. Interesting thing is that it is a YC company too.

[1] https://speedscale.com/

Yes, and Ken from speedscale is a very helpful person too.
Neat idea! Two nitpicks from reading the website:

> How does CodeParrot work

This part without a question mark is making me feel like I’m left hanging. If you don’t want a question mark, I would much prefer this to be “How CodeParrot works”.

> CodeParrot … What’s more? …

There’s no question mark where there should be one, and where I don’t expect one I do see one. Here it can be a simple comma instead of a question mark. This flow of language works if you’re speaking live to an audience but on paper it feels awkward.

It bothers me more than it should, mostly because of my reading cadence. Hopefully others can chime in and let me know if I’m off base here or not.

Nice catch, thanks for taking the time to review the website! Have updated it, should reflect in sometime.
I worked on something like this briefly years ago and caching and mocking the downstream calls was a beast of a technical problem.

For example, suppose a startup has an Express app that made downstream requests to Postgres, Redis, Kafka, other REST APIs, etc. How do the outgoing requests, which may follow different protocols, with varying serialization formats and handshakes, all get intercepted, recorded, and matched to the same outgoing calls when the session is replayed? How are the outgoing requests indexed and then evaluated for sameness when replayed?

It definitely seems possible to implement something like this as a series of high-level one-off integrations, for example, a middleware package for Django apps. But if the goal is a universal and automatic downstream service mocking system, it's not clear to me where this sort of middleware would even sit in the application stack. It seems like it would need to be pretty low-level, but not too low-level because the data needs to be decrypted first.

Anyway, if you guys have figured out a good way to do this, I'm definitely interested in hearing how you managed it. I'm in no way a great BE engineer, so there could be something elegant or obvious that I missed.

I think the fact they use opentelemetry figures into this. They’ve already done the hard work of intercepting calls to many libraries.
Yes, we rely a lot on openTelemetry for this. They have really good support for most libraries in Java, node and are progressing quickly in others. We are also contributing to it by extending support for other languages, which we'll be open sourcing soon.
If you are contributing, isn’t it open-source by default? Or you mean you have ‘proprietary-ish’ (since it’s all client side, kind of impossible) packages that aren’t part of opentelemetry yet?
Yes, I meant packages not part of opentelemetry - example python has lot of DB packages which don’t have support yet.
Does opentelemetry also support mocking outgoing requests values on replay? Or is something else used for that part?
No, it doesn’t have by default but can be extended to support it.
> For example, suppose a startup has an Express app that made downstream requests to Postgres, Redis, Kafka, other REST APIs, etc

And besides network calls network-bound requests, what about reading from sqlite, files, and interprocess communication.

What about ZeroMQ over UDP? A remote time service. Pulling from a git repo.

No really, I'm asking because I'm trying to figure out how to deal with all this in my tests right now, and I honestly have no idea. But I doubt Codeparrot is going to be able to intercept and intelligently mock everything.

> What about ZeroMQ over UDP? A remote time service. Pulling from a git repo. > > No really, I'm asking because I'm trying to figure out how to deal with all this in my tests right now, and I honestly have no idea. But I doubt Codeparrot is going to be able to intercept and intelligently mock everything.

In theory, *everything* can be recorded and replayed; that's the magic of patching-in at the application layer. *Any* function call can be recorded, and, replayed.

In practice, we support whatever is demanded by our customers. If people are ready to pay for something, we are happy to build support for it.

I love to see more activity in this area!

I'm maintainer of GoReplay https://github.com/buger/goreplay and work in this area for the last 10 years.

It is quite hard problem to solve, because you have to deal with state difference between test and production environments. Love your approach to mocking dependencies, and leveraging OpenTelementry. It potentially can solve some of state issues. But still require modifying user code. I wonder if it can be done purely using OpenTelementry (e.g. you depend on typical OTel setup), and then read the data directly from OTel DB.

Cheers!

Thanks Leonid! Your vote of confidence means a lot.

OTel for go requires user code changes. Languages that allow monkey-patching (java, js, python, etc.).

> I wonder if it can be done purely using OpenTelementry (e.g. you depend on typical OTel setup), and then read the data directly from OTel DB.

OTel doesn't work out of the box. OTel usually doesn't collect request or response for any network or db call. 90% of my time is spent on extending the individual agents' code; so that they can collect additional required information, and perform "replay".

Go replay has been one of the inspiration Leonid, so glad you checked out CodeParrot :)

Typical Otel implementation don’t capture some request data esp parameters and replay part is missing among few other issues, so we need to extend it.

thank you Leonid for the GoReplay. A great ecosystem of products will be built on top of it.
I hope so! But I also hope that I will be also able to monetise some of this movent. GoReplay dual licensed under AGPL and Commercial license. I also sell special appliance licenses.

If anyone in this thread wants to build a product based on GoReplay technology (capture network traffic directly, via AWS Traffic Mirroring or k8s), sent me message :)

I just this week did something similar for a line of business app API I'm migrating. I added code to the API to record all unique requests to a .http/.rest format file (natively supported in vlatest Visual Studio 22 and via extension in VS Code). I can then play those back manually or via automated integration tests that read the .http files. Yes, I'm hand-waving over authentication and downstream database and 3rd party APIs, but overall it's working well to quickly test against tons of production-like API calls.
Wow very interesting!

Did you manage to play it back manually and/or via tests? I ran into unexpected challenges while doing this.

Yes, I have it set up where I can play the tests manually via the .http file or automated via a test suite. In general, I parse the .http file- each ### represents a new test, next line is METHOD URL, next lines are headers, empty space, then request body. Would vary based on test framework, but for mstest, it looks like this:

[DataTestMethod] [DynamicData(nameof(GetAllHttpTests), DynamicDataSourceType.Method, DynamicDataDisplayName = nameof(GetTestDisplayName))] public async Task RunHttpTests(string endpoint, string method, string data){...}

To be clear, this is integration testing, not unit. Testing that certain HTTP requests work, not necessarily that they are correct.

I built something like this years ago. It's just on the edge of roll-your-own-solution.

I definitely agree with your approach of auto mocking the database and/or third party services too. That's what I did with my home rolled solution.

Happy to see more projects in the space. Generating tests by snapshot including the DB and service calls always seemed like the obvious way to go for me.

As a dev I'm not keen on using services which I could easily replicate so this would have to be both substantial and cheap/free for smaller teams.

I can relate to this perspective, however, some complexities we have come across in building this so far:

- Support for high number of languages, downstream dependencies - Intelligent sampling to choose requests with high coverage and auto update them over time - Performance, safety and data compliance guarantees

Isn't your tech fragile? Feels like it'd break/provide false positives easily. Integration is also not easy, the matrix of possible tech stacks is big. It doesn't test new edge cases, so it's useful only for regression, which is important but if your testing hygiene is good, which means devs are writing tests, you wouldn't have the need for your product, am i wrong? Good luck either way. Seems like VCs are gambling at products that save costs.
Good observation, it's challenging to solve these problems, here's how we are going about it -

To reduce false positives - we run the same request twice to eliminate flaky fields in response like current timestamp, mock the downstream dependencies as they behaved in prod env and are providing options to ignore / modify the sampled requests

To make integration easier - we are building on top of opentelemetry which has seen remarkable increase in support across languages / frameworks, which makes it easier for us to support different tech stacks.

Regression testing - our primary goal is to provide regression tests. We have come across two type of teams where this makes sense - companies with low test coverage and companies which high number of micro-services as they find it hard to cover every production scenario in tests

I swear I see exactly the same idea at least once a month on HN.
Are the requests being only replayed, or is there some amount of mutation going on too, to potentially reach buggy states?

If the latter, I'm wondering how it might compare to tooling with a similar intent, e.g. https://www.microsoft.com/en-us/research/publication/restler...?

As of now, no automated mutation although we do give option to modify the request by developers

In my experience, fuzzy testing is more helpful from Dast / security testing perspective and we were thinking of adding these later.

Neat concept. In regulated environments, how would you propose implementing this to minimize the spread of live, regulated data into non-prod environments? think full PANs (PCI) etc.

I don't know that there's necessarily a wrong answer here (well, there probably are, but wrong only in the sense that a given solution might be prohibited by the regulation), just want to see how y'all have thought through the prompt.

We have a anonymiser which identifies common sensitive /Personally identifiable data like credit card, zip code and replaces them with anonymised data.

We also provide configuration option to specify additional fields are needed to be anonymised

Do you tokenize that data so that it stays consistent through all the flows? Fits the same parameters etc?
This seems like the burning question, maybe that's the .ai...

HAR can already be recorded in middleware (e.g. loadmill/har-recorder) and replayed in multiple CI compatible ways.

How would that potentially require AI?
Unrelated to this but responding to your other question that was deleted but still valuable:

> What do you mean by "live, regulated data into non-prod environments" exactly? Could you provide some examples?

Credit card numbers, card verification codes, protected health data/electronic medical records... list goes on.

Every environment that has live data in it functionally increases the valuable attack surface for most adversaries. I.e why bother attacking production when they can slurp up production data from test environments that are less likely to be well protected?

As for the "AI," I think the op was just commenting on the TLD used by the startup.

"All the downstream services are mocked, you don’t need to set up a test environment."

Could you elaborate on how this is achieved? Eg, say I have a lambda endpoint and the code in it is querying one or more databases. Are you somehow automatically hooking into those function calls, recording their return values, and then mocking the functions in the replays? Or are you doing something else entirely?

Are you somehow automatically hooking into those function calls, recording their return values, and then mocking the functions in the replays? - this is correct.
Congratulations on the launch!

Quick question, how do you (plan to) deal with schema/API change? Or the tool is more intended for regression testing?

New API tests can be generated by enabling the agent locally and/or on a staging env.

Its simply a matter of using a baseline of: `production` (usually for regression tests), `staging` or `local`.

Nice! How do you deal with:

1. Statefulness: simple example, api call that returns a query from the DB that has a high transaction load

2. Non determinism: for example returns a random number, guid or a time

3. Privacy requirements for certifications or legislation that mean private data cannot be used in a test environment.

4. GDPR laws

5. Authentication

I presume this is more microservice friendly and monolith unfriendly. But that is probably a reason on the pro side of smaller, bounded services.

> 1. Statefulness: simple example, api call that returns a query from the DB that has a high transaction load

We natively handle API and DB calls.

> 2. Non determinism: for example returns a random number, guid or a time

This is a bit tricky. We run the same test multiple times, and ignore the changing fields.

> 3. Privacy requirements for certifications or legislation that mean private data cannot be used in a test environment. > > 4. GDPR laws

We anonymise *any* private data before storing it for replay.

> 5. Authentication

We capture auth headers, and mock auth server responses.

> I presume this is more microservice friendly and monolith unfriendly. But that is probably a reason on the pro side of smaller, bounded services.

Currently yes. APIs can be easily tested. Tests for function calls will require user configuration.

Thanks, they are great answers. I think as always you can’t expect to magically throw a test tool at the application. You need some effort to make it testable. Moving out auth and anything else stateful to another service that can be mocked helps.
Does it work the other way too? It would be nice to also get the response from downstream services, and test if the app is still massaging the data correctly and generating the correct output.
We can do that too!
Oh wow, we're building something very similar to this. We were literally just filling up the YC form for S23 too haha. I may be a bit biased, but I think this is an excellent idea, and it can change the way testing works for backend services. We were inspired by Meticulous, too! Hope we can learn from y'all as well :)
Nice! Happy to share our experience if it helps :)
How is this different from Meticulous, another YC company?
it's similar in the sense that both rely on production traffic and user sessions to generate tests. However, we are focusing on API testing and I think Meticulous is building for UI testing.
Looks promising! Would love to know how this is different from creating postman collections, and running tests on those
- Postman collections have to be created manually by a developer (usually).

- Downstream calls (think DB, 3rd party API calls, kafka) are not handled.

Nice one guys! Congrats on the launch!