• Skip to main content
  • Skip to search
  • Skip to footer
Cadence Home
  • This search text may be transcribed, used, stored, or accessed by our third-party service providers per our Cookie Policy and Privacy Policy.

  1. Community Forums
  2. Custom IC Design
  3. ocnxlRun never returns as some splits are hung forever

Stats

  • Locked Locked
  • Replies 3
  • Subscribers 125
  • Views 13215
  • Members are here 0
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ocnxlRun never returns as some splits are hung forever

Tajinder
Tajinder over 11 years ago

I am using ocean script to launch ADEXL simulations in netbatch (pool of machines). After setting up the testbench, I use ocnxlRun to launch the simulation. Several hunderd splits get launched in parallel. Due to machine or other issues, it so happens that one of the splits never returns. Therefore the entire job just waits forever and we never get the result back. As a workaround, I typically ssh to the machine running the hung split and kill virtuoso process. It then restarts the split from scratch and job finishes. We tend to loose significant time on this and lately we are seeing more than one splits that are hung. Do you have suggestion on how to deal with this situation? Do you think specifying runtimeout to say 2 hours would direct the master process to kill the splits that are running for 2 hours and restart them? Any other recommendations?

This is the job setup I am using: 

ocnxlJobSetup( '(

        "ADEXL_NB_POOL" "pool_name"

        "ADEXL_NB_QSLOT" "qslot_name"

        "blockemail" "1"

        "configuretimeout" "-1"

        "distributionmethod" "NB interface (free)"

        "lingertimeout" "30"

        "maxjobs" "240"

        "name" "Netbatch"

        "preemptivestart" "1"

        "reconfigureimmediately" "1"

        "runtimeout" "-1"

        "showerrorwhenretrying" "0"

        "showoutputlogerror" "1"

        "startmaxjobsimmed" "1"

        "starttimeout" "432000"

) )

 

  • Cancel
  • Tom Volden
    Tom Volden over 11 years ago

    Hi Tajinder,

     Based on the description, it sounds like the simulator is dying for some unknown reason and not returning an exit code to the ICRP process (the virtuoso process on the remote machine that you are manually killing in order to restart the point).  Unfortunately since the ICRP process is not receiving a return value back from the simulator it thinks everything is fine and just continues to wait for the simulator to report back that it has completed simulating the point.  Of course it waits forever since the simulator has apparently crashed.

    You have identified the workaround that I would suggest - setting the simulation timeout to a value 1.5-2X the expected time for a simulation to complete.  This gives some leeway for slower or highly utilized machines, but should recognize situations like you describe and try to kill the (non-existent) simulation and restart it.

     Regards,

    TOM

    • Cancel
    • Vote Up 0 Vote Down
    • Cancel
  • Tajinder
    Tajinder over 11 years ago
    Thanks Tom for taking the time to reply. I am not sure if runtimeout would work, but let me give it a shot.
    • Cancel
    • Vote Up 0 Vote Down
    • Cancel
  • Tajinder
    Tajinder over 11 years ago
    In the following scenario, the runtimeout didn't help.  I specifed rutimeout of 6 hours.

    The machine running split #1536 crashed (runICRIP325). Netbatch promptly restarted the job on a different machine.
     

    However this job failed right away as the die netbatch job didn’t remove the Job325.log.cdslck file.
     >>>Failed to lock log file: <blah><blah>/Job325.log  

    The parent job is STILL waiting for last 24 hours.

    Suggestions?
    • Cancel
    • Vote Up 0 Vote Down
    • Cancel

Community Guidelines

The Cadence Design Communities support Cadence users and technologists interacting to exchange ideas, news, technical information, and best practices to solve problems and get the most from Cadence technology. The community is open to everyone, and to provide the most value, we require participants to follow our Community Guidelines that facilitate a quality exchange of ideas and information. By accessing, contributing, using or downloading any materials from the site, you agree to be bound by the full Community Guidelines.

© 2025 Cadence Design Systems, Inc. All Rights Reserved.

  • Terms of Use
  • Privacy
  • Cookie Policy
  • US Trademarks
  • Do Not Sell or Share My Personal Information