Escaping the Notebook Trap - Clean Architecture for Apache Spark

As software created with Apache Spark grows in complexity, the “notebook-first” approach usually leads to what is known in software engineering as a “Big Ball of Mud”. The created notebook contains a monolithic application with evolutionary design. Logic is often implemented directly in notebook cells, components communicate with data frames that have implicit schemas, and testing becomes increasingly complex and slow.

Read More

How to build Python packages reproducibly with Poetry

For handling dependencies and creating Python packages, Poetry is a great choice. Poetry’s build command can generate source and wheel distributions. The wheel is a pre-built distribution format containing files and metadata, which only need to be moved to the target system to be installed. On the other hand, source (or sdist) distribution still requires a build step before it is usable. But can these formats be directly utilized in production?

Read More

Efficient dependency version management

When working with dependencies one commonly asked question is how to specify the dependencies in the package files (pyproject.toml, Gemfile, package.json, etc.), and why one would need lock files (poetry.lock, Gemfile.lock, package-lock.json). In this article, we will explore how dependency management can be easy and painless. Let’s dive in!

Read More