Jump to content

Recommended Posts

When using an external program with CAESES, there is one major requirement to that program: Once the executable is finished, all result files need to exist. This is true no matter whether the external software is called directly as a local application or through the SshResourceManager as a remote application.

 

Why is that so? CAESES needs to know when the external program is done writing its results, in order to be able to read in those results. So, what it does is to assume that once the executable that was called returns, all results are available.

 

What does that mean for me? In general you can safely assume that a program's executable will not return until the actual process is finished. However, there are some exceptions to that rule. One example for such a behavior (although it is highly unlikely that it will be used from within CAESES) is Firefox. Only the first instance of Firefox will behave in the required way. When starting a second instance of Firefox, the process will return right away, as it uses the original Firefox process to run.

 

Another example that is a lot more likely to be used from within CAESES is when trying to execute an external program through a grid engine, for example the Oracle Grid Engine (formerly Sun Grid Engine). In general, submitting jobs (i.e. requests to run a certain program) to a grid engine through a program like qsub is a non-blocking process. The job is added to the waiting queue of the grid engine and the submitting program will return right away. Additional tools (e.g. qstat) can then be used to monitor the state of the job. However, when using qsub or a similar program as the executable in CAESES, it will be recognized that the executable has returned and assumed that all result files are present, which is not the case.

In the example of the Oracle Grid Engine there is a way to tell qsub that it should not return until the job has finished. That execution mode is done by providing the "sync y" option to qsub (either on the command line or in the configuration file). Note that the Sun Grid Engine must be configured to allow blocking jobs (i.e. must allow to use the "sync y" option).

 

How can I find out if my executable is affected? Well, if you never experience the problem that CAESES complains about result files not being there, although they are when you look in the file browser, or that a computation finished although you know that the external program is still running, you don't need to worry about it.

If, however, you do experience one of those two problems, or are planning to use a grid engine like the Oracle Grid Engine, you can do the following to ensure that the described problem does no affect you:

Run the program from a terminal (cmd.exe on Windows). If the terminal returns to the command prompt right away although the program continues to run, you have identified the problem. As the Firefox example shows, even if it does not return right away, you may have to double check by starting a second instance of the program through another terminal.

 

If you have identified, that your external program indeed shows this behavior and is the problem why your integration does not work, one way to solve the problem, is to wrap the call to the executable in a shell script that stores the process ID of the started exectuable in a variable and periodically looks whether the PID still exists. Once it does not exist anymore, the shell script should exit. An alternative to a shell script would be to write a simple program using your favorite programming or scripting language.

Share this post


Link to post
Share on other sites

Hi,

 

I have connected the software using the sun grid engine and I have also used the 'sync y' in the argument as suggested but an error occurs.

 

The monitor displays the following :

 

 *** INFO FResultsGenInt : File read error [Failed to open file.....]

 

It looks like the files (from each job) are being written to the parent working directory and not the sub-directory (e.g. 01_Sobol).

 

In the input java macro files, there are no absolute paths supplied - only file names. This works correctly on the PC, but not on the cluster.

 

Is this problem anything to do with what you suggested above? Can you please shed light on the matter?

 

Regards,

 

Kurt

Share this post


Link to post
Share on other sites

Dear Richard,

I think you have to tell qsub the working directory in which it is supposed to put the output files. If I am not mistaken this can be done with the "-d" switch of qsub, but I am not sure. Please consult the qsub/SGE documentation.

Best regards
Arne
 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...