We have had some issues with some regressions running with DRM setup and vManager due to an exclusive lock error.
This makes the whole regression get stuck on waiting, completely breaking our setup:
This seems to be an intermitent issue. Sometimes it works but sometimes it does not. It can fail with any other user doing regressions.Does anyone know what that issue can be related to?
My initial guess would be that the intermittent nature is caused by scheduling / timing of when jobs are starting on the cluster. It looks as though your run script for each test might be trying to recompile the design into the same place, so when a simulation is running, no other simulation can start. This is because compilation needs to take an exclusive lock on the compile directory to ensure other processes aren't writing to the same compiled database. You should check and if possible, compile once (in pre_session_script) and run all the tests with -R to avoid recompiling, or if you must recompile for every test, make sure you're not using a switch like -xmlibdirname or -xmlibdirpath which point the compile directory to some place outsid ehte auto-generated run directory.
Another (less likely) problem that sometimes happens is that one or more compute nodes on the cluster have NFS or file locking daemon problems which prevent any locks being created. To resolve that, you would need help from your IT team, to check the affected nodes. One way to diagnose that and eventually work around it is to open the run view for the session, and filter the runs based on status==failed, first failure description containing "exclusive lock", and group the results by the hostname. This will give you an idea how many and which hosts are problematic. If you see only a small number of groups (hosts) then you could try filtering those out using runs_dispatch_parameters in the VSIF, to tell LSF not to send jobs to those hosts (check the "bsub" command syntax for details of how to filter hosts out). If there are lots of groups and only a small number of runs per group, we can assume that it's not an IT issue and it's more likely to be a scripting error in your run script.
If these hints don't help you to the solution, may I suggest you login to support.cadence.com and file a support ticket so that one of the hotline team can schedule a screen sharing call with you so we can look at exactly what's happening?
Thank you very much for your help.
The same regression workflow is working successfully with any other user. We are experiencing this errors when running with a specific user (which, permissions wise should be the same as any other user), so the scheduling/timing of the jobs should not be the root cause.
Looking into the logs, I've found this in the local_log:
After the timeout , it tries to kill the processess associated to this job, but somehow it is unable to kill one of them and the jobs gets stuck on a loop trying to kill it. The same 5 lines of killing the process repeats endlessly. If I stop the job with a qstat, the regression resumes and everything goes back to working as intended.
Do you know where this issue might be coming from or how to fix it?
Interesting, I never saw that before. It would have been interesting to see what the commands associated with the process IDs 1370, 1056 and 1039 were, for example are those xrun commands?
Maybe next time it happens you could try to find that out before killing the job with qstat.