A couple of weeks ago I had to create a method similar to ActiveRecord::Batches#find_in_batches. The goal was to have an abstraction over querying a large data set in chunks. The data needed to be queried in chunks for performance reasons (i.e. less memory consumption and smaller query latency).
An example of an hypothetical implementation of that method can be found in the code sample below.
I needed find_in_batches
so I could do some operations with the resulting dataset, and eventually I realized that I should do those operations in a background job. The goal was to keep the querying logic and the processing of the data at different abstraction levels hence the different methods.
The problem with the example above is that even though I am doing the queries in chunks, the background job (i.e. complex_operation_in_background
) is still going to involve the whole dataset. So, the background job will take more time as the dataset size grows which could negatively impact other background jobs in the queue. Ideally, this complex processing operation would be split around multiple delayed jobs where each would not take much time to run.
The first solution I considered to address that problem was to use use recursion. With recursion each background job would query and process one batch, until we ran out of batches to process.
My problem with the solution above was that it involved leaking the knowledge/complexity around querying in batches to the method that processes those batches (i.e. complex_operation_in_background_with_recursion
).
Last week I stumbled into Fibers (Thank you RubyTapas!) and decided to give them a try.
A Fiber is a coroutine, a programming primitive I am still trying to wrap my head around. But for the purposes of this example, a Fiber is like a block whose execution can be suspended, passing control back to the caller, and resumed from the point in which it was suspended.
Coroutines (and by extension fibers) have many possible uses but in this example a fiber is going to be used to create an iterator. The fiber allows us to add an iterator like behaviour to the find_in_batches
method, by allowing us to ask for the following batch until there are no more batches. This allows us to have recursion in the background jobs while still encapsulating the complexity of querying in batches.
You can find all the code samples here.