Bug
Too many simultaneous jobs cause system failure
Issue description
In the case at hand, 200 jobs were scheduled programmatically incorrectly to run all at the same time. The job runner picked them up and executed one by one with a 100 ms delay. Since the duration of a single job was roughly 30 seconds, the system started failing since no additional MySQL connections could be established.
Developer comments
The system behavior was improved, so that an simple programming bug cannot have such dramatic consequences. Instead starting all jobs that are thrown at it without thinking, the topincs service now keeps track of how many jobs are running. If the hardcoded limit of 10 is reached, it stops creating new jobs but keeps checking if running jobs have finished in the meantime. It also rejects new jobs if the sysload is above a computed limit. Implemented a regression test for this.
The current fix might have a shortcoming in the future. If there is 10 long running jobs, e.g. data imports, all other jobs are stalled, e.g. sending emails. But for now numerous simulteneous jobs are rare and long running imports even more so.
|
Work sessions2
Start |
2023-02-20T14:00:00
|
End |
2023-02-20T18:15:53
|
Participant |
Robert Cerny
|
Start |
2023-02-20T21:00:04
|
End |
2023-02-20T22:00:08
|
Participant |
Robert Cerny
|
|
We are sorry
This page cannot be displayed in your browser. Use Firefox, Opera, Safari, or Chrome instead.