Looks well done (it uses unicode art, so it must be amazing) but I have a fundamental distrust/dislike of record/replay frameworks...just seems like you're papering over an inherently bad testing approach.
E.g. sure, when replays work, they're great, but:
a) you have to do a manual recording to create them the first time (which means manually setting up test data in production that is just-right for your test case)
b) you have to manually re-record when they fail (and again means you have manually go back and restore the test data in production that is just-right for your test case...and odds are you weren't the one who originally recorded this test, so good luck guessing exactly what that was).
In both cases, you've accepted that you can't easily setup specific, isolated test data in your upstream system, and are really just doing "slightly faster manual testing".
So, IMO, you should focus on solving the core issue: the uncontrollable upstream system.
Or, if you can't, decouple all of your automated tests from it fully, and just accept that cross-system tests against a datasource you can't control is not a fun/good way to write more than a handful of tests (e.g. ~5 smoke tests are fine, but ~100s of record/replay tests for minute boundary cases sounds terrible).
> you have to do a manual recording to create them the first time (which means manually setting up test data in production that is just-right for your test case)
Your test invokes the the recorder. There isn't anything manual outside of writing & running your test.
> you have to manually re-record when they fail
Again, nothing manual. It would require running your test again with Polly in record mode if you want to "refresh" the recording with a newer set of responses.
> In both cases, you've accepted that you can't easily setup specific, isolated test data in your upstream system, and are really just doing "slightly faster manual testing".
This is by no means a replacement to E2E testing. It is a form of acceptance/integration testing where you're testing your application against a point in time that you verified all systems were talking correctly with your application. E2E tests are much slower, difficult to debug, and intended to capture those breakages in contracts.
It's a tool for your toolbox, reach for it when needed. We plan to release a tutorial/talk to should clear up any misconceptions. There are also other applications for Polly such as building features offline or giving a demo using faker to easily hide any confidential data.
> It's a tool for your toolbox, reach for it when needed
Sure, apologies for being negative about a tool you've worked on and are rightly proud of. I'm sure you already have more users than any open source project I've ever written. :-)
I struggle a bit at this point in my career, as I've made enough mistakes and seen enough mistakes, that I generally have strong gut opinions on "yeah, that's probably not going to work/scale/etc."
So, when observing new developers/teams starting to "make a mistake" that I've seen before, my gut says "no! bad idea!"...but I know I could be wrong, so it's tempting to say "well, sure, that didn't work for us, but go ahead and try again".
Because, who knows, maybe eventually someone will figure out an innovation that makes a previously-bad approach now tenable, and even best-practice.
But, realistically, that rarely happens, and so teams, orgs, the industry as a whole stumbles around re-making the same mistakes, and codebases/teams/etc. pay the cost.
I've thought a lot about micro-service testing at scale:
Basically there are no easy answers, short of some sort of huge, magical, up-front investment in testing infra that only someone like a top-5/top-10 tech company has the eng resources to do.
So, definitely appreciate needing to do "something else" in the mean time. ...record/replay is just not a "something else" I would go with. :-)
Yes, sorry for being inexact/overusing the term--I understand the tests drive the recording.
What I meant by manual is getting the e2e system into your test's initial state.
E.g. tests are invariably "world looks like X", "system under test does Y", "world looks like Z".
In record/replay, "world looks like X" is not coded, isolated, documented in your test, and is instead implicit in "whatever the upstream system looked like when I hit record".
Which is almost always "the developer manually clicked around a test account to make it look like X".
This is basically a giant global variable that will change, and come back to haunt you when recordings fail, b/c you have to a) re-divine what "world looks like X" was for this test, and then b) manually restore the upstream system to that state.
If no one has touched the upstream test data for this specific test case, you're good, but when you get into ~10s/100s of test, it's tempting to share test accounts, someone accidentally changes it, or else you're testing mutations and your test explicitly changes it (so need to undo the mutation to re-record), or you wrote the test 2 years ago and the upstream system aged off your data.
All of these lead to manually clicking around to re-setup "world looks like X", so yes, that is what I should have limited the "manual" term to.
But in the case we're talking about, where you're reliant on an external service that can change underneath you, "world looks like X" is genuinely not under your control. It feels like pretending that it is will lead to just as many failures as acknowledging it's inherent volatility.
Agreed! And, to me, record/replay is still pretending like it's controllable, b/c even if you decouple for replays, records will always be a PITA.
My depressing solution is to just not even try to automate tests against the upstream system and instead invest in test builders/DSLs that make mocks/stubs on both sides as pleasant as possible.
And when bugs slip through, make sure to update your stubs/mocks on both sides to prevent the regression.
To me this gets the most agility and reliability, and will be a test suite that developers don't hate 1-2-5 years down the road.
I also don't like recording frameworks for TDD (and similarly dislike using fixtures). However, the place where a recording framework really pays for itself is in isolating changes in protocols. I've often had to interface with, as you put it, uncontrollable upstream systems. These are systems that are not mine -- they are upstream services from other companies that I have to interact with and I have no control over. Often these systems are badly built and they play fast and loose with the "protocols".
In these case I like to have an adaptor layer and use a recording framework to "test" the adaptor. That way I can occasionally rerecord my scenarios and be notified if something important has changed. Normally what happens is that my service stops working for some unknown reason. I rerecord the adaptor scenarios and usually the reason pops out very quickly. All the rest of my code is coded against the adaptor and I stub it out in their tests (which I can do reasonably well because I control it).
I've worked for a while with similar scenarios of needing to integrate with systems beyond my control (e.g. Stripe, Slack, Google), and though I still don't have a good setup for it, I've come to the conclusion that a two-pronged approach would be ideal: Record/and/or stub the calls and responses from the external services so your normal tests are run entirely without external networking (like what Polly allows you to do). But also set up a server to periodically validate the responses from the external services against those that have been recorded, and alert you to any changes in their protocol/behavior. I've yet to see any middleware to tackle the latter (though granted, I haven't been looking too hard yet either)
Ideally protocols are declarative/documented/typed, e.g. Swagger/GRPC, so you can be more trusting and not need these, but often in REST+JSON that does not happen.
> All the rest of my code is coded against the adaptor
Nice, I like (the majority?) of tests being isolated via that abstraction.
Although, if "the protocol" is basically "all of the JSON API calls my webapp makes to the backend REST services", at that point do you end up adapter scenarios (record/replay recordings) for basically every use case anyway? Or do you limit it to only a few endpoints or only a few primary operations?
Yes, I like to have scenarios for all of the end points. It definitely doesn't reduce the number of tests :-) The advantage is in isolating the protocol from the operation of the application.
The main complaint I've seen for this approach is, "We shouldn't be testing the other person's system". There is wisdom in that advice, but it really depends on how much you depend on the 3rd party service and how much downtime you can tolerate. For example, I'm working in the travel industry right now and we often rely on small services that nobody has ever heard of. If we can't use the service then we can't sell anything and our site is essentially down. If it happens frequently (and with a lot of these travel services, they often break things weekly if not daily), then your site is not viable. In that case I'll exercise the protocol as much as I can. However, we also talk to marketing services, etc. If that breaks, and it takes a day or two to get it back up, then it's not a major problem -- our marketing effort might be a day late, which is unfortunate, but not game breaking. In that case I'll usually have a smoke test or two.
Also, recording raw HTTP requests makes tests difficult to organize. Especially if what you want to mock are requests to a REST server. In that case, all the recorded HTTP headers aren't significant, editing the recorded resources in the responses when the API changes is a pain, and testing scenarios where several REST requests are related (e.g. fetching posts than comments) is also a pain.
A better alternative IMO is to craft a list of resources in JSON, then use this data in a fake REST server that takes over fetch and XHR in the browser.
Incidentally, that's the way [FakeRest](https://github.com/marmelab/FakeRest) has been working for years (disclaimer: I'm the author of this OSS package).
Insignificant things like HTTP headers and extra fields are insignificant until they're not. In my experience, manually assembling what you expect an HTTP response to look like often leads to bugs when an "irrelevant" detail suddenly becomes relevant, like when status codes change, or fields are added in a way that breaks a client, etc.
I think recording tools can be a sharp tool, and require care, but (as a starting point), if you have an automated library that can generate recorded fixtures in a repeatable, automated fashion, you can eliminate a lot of the pain points while still reaping all of the benefits. That's how we set it up where I am - responses and fixtures are generated as part of a full suite execution, but persist with individual test runs.
Not sure I understand what you are proposing as an alternative. It seems that you can either test against a 'real' system on the other end, which probably means one project pulling, building and standing up possibly dozens of other services just to run it's test suite. Or, you mock it in some way. I prefer the recording approach as it mocks at the lowest level possible giving you the most test coverage possible.
E.g. sure, when replays work, they're great, but:
a) you have to do a manual recording to create them the first time (which means manually setting up test data in production that is just-right for your test case)
b) you have to manually re-record when they fail (and again means you have manually go back and restore the test data in production that is just-right for your test case...and odds are you weren't the one who originally recorded this test, so good luck guessing exactly what that was).
In both cases, you've accepted that you can't easily setup specific, isolated test data in your upstream system, and are really just doing "slightly faster manual testing".
So, IMO, you should focus on solving the core issue: the uncontrollable upstream system.
Or, if you can't, decouple all of your automated tests from it fully, and just accept that cross-system tests against a datasource you can't control is not a fun/good way to write more than a handful of tests (e.g. ~5 smoke tests are fine, but ~100s of record/replay tests for minute boundary cases sounds terrible).