Really interesting going through your code, thanks for sharing.
Why did you opt for running the parallelization code in a separate python process (s3op)?
You might want to update section ‚Caution: Overwriting data in S3‘ in the docs since S3 offers strong read after write consistency since dec 2020.
re: separate process - fault-tolerance is a key requirement. There are a myriad of ways how a highly parallelized, network- and data-intensive code can fail, so isolating it in a separate process is a safer approach than trying to try-except everything and hope it works.
Good catch re: the warning about consistency! The docs were written before the change :)
You might want to update section ‚Caution: Overwriting data in S3‘ in the docs since S3 offers strong read after write consistency since dec 2020.
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...