Affects Version/s: 3.8
Fix Version/s: 4.0
When reclaiming mis-delegated work buckets (e.g. when original worker fails and other workers are to reclaim its work) it is possible that a bucket is mistakenly reclaimed to multiple workers at once. I.e. it will be processed more than once. See TestWorkDistribution.test230WorkerException.
The scenario is like this:
- Worker 1 reclaims the mis-delegated bucket, marking it as DELEGATED and changing its own work state
- Until own work state is updated, worker 2 checks for mis-delegated buckets. Finds one that is DELEGATED but no worker has it in its own work state (as the operation in worker 1 has not completed yet). So it allocates the bucket for itself.
- Add "delegated to" information to a work bucket. Worker 2 would then check if the delegate exists and is not closed (it would not help to check delegate's work state).
- Bind changes of both tasks' work state (coordinator + worker 1) into single DB transaction.
- Execute bucket reclaims within separate task or thread or single dedicated worker.
We would go probably with the option 1. It would also eliminate the need to fetch the whole task tree when trying to obtain work buckets to reclaim.