Jump to content
Britton Ward

Advice on Running FineMarine/Hexpress via SSHResourceManager

Recommended Posts

Hi,

 

After a few days trying I have successfully set up the SSHresource manager with a Postgres DB back end.  CAESES can communicate with the server but when I attempt to execute a hexpress job it fails.  Does anyone have suggestions on a) what the application executable calls for Hexpress and FineMarine should look like in the SSHresourceManager and B) any suggestions on custom canceller classes for SSHresource manager for these tools?  I am running the solver/meshers on Linux Centos X64.

 

Thanks for any advice,

 

Britton Ward

Share this post


Link to post
Share on other sites

To add on... I have verified that the input files are successfully transmitted to the host.  The file _runRemoteProcess_.bat is written to the job directory but it does not appear to execute correctly.  If I SSH to that node and execute the batch file ./_runRemoteProcess_.bat  it launches hexpress and executes the script as designed.  Why would the command not be executing correctly when run from the resource manager?

Share this post


Link to post
Share on other sites

I think I managed to get my test running with some tweaks to the application path.  Would still be nice to see a code example for the custom canceller's if anyone has one for FineMarine...

Share this post


Link to post
Share on other sites

Hi Britton,

 

I don't think you need to worry about the canceler class. It is not application dependendent, but solely an operating system issue.

 

What the default implementation for Linux does is execute a "pgrep -n <processname> -u <username>" right after starting the remote process in order to find the process ID of the started process. Then, once the kill is requested, it issues a "kill -9" for all child processes of that process ID and finally a "kill -9" for the actual process ID. This should work for all Linux operating systems. You can take a look at the file <CAESES install dir>/tools/sshResourceManager/doc/javadoc/FJobCancelerLinux.java" to see the current implementation.

 

Best regards,

Arne

Share this post


Link to post
Share on other sites

Arne,
 
Thank you for the response.
 
The issue with the canceller classes for both hexpress and finemarine is that the programs are launched by scripts or batch files and such the process name is different to that entered into SSH resource manager.
 
For instance:  hexpress is launched by the call /apps/numeca/hexpressmarine30_1  but the process is hexpressx86_64  
 
For the Fine/Marine solver things are more confusing and I haven't managed to get running at all.
 
Executing the fine/marine process is typically done using a batch file like the attached rather than “….executable.exe –batch ..” syntax that is used for the mesher etc.  It seems the SHHresource manager requires an executable path but this batch file will be written by the FW to the design directory so am confused about how to get SSHRM to launch the job correctly.  What syntax should go in the FineMarine application field on the SSHRM so that this batch file can be launched?

 

This relates to the questions on the canceller as the process ID name isiscfdd is not the same as the app name used to launch the computations.  A typical linux batch file is as follows.

 

Typically the solver is cancelled by editing the stop.now file  with echo -1 > stop.now and this sends the signal to the solver to shutdown correctly.  I guess this could be implemented in another canceller class.

 

#!/bin/sh
#############################################
##                 SETTINGS                  
#############################################
NI_VERSIONS_DIR=/apps/numeca/finemarine23_2
 
export NI_VERSIONS_DIR
 
LD_LIBRARY_PATH=$NI_VERSIONS_DIR/LINUX/_mpi/_ompi1.4/lib:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH
 
#############################################
##  WORKING DIRECTORY
#############################################
WDIR=/cfd_work/share/PCT38_M3_T00_v5/PCT38_M3_T00_v5_r5
cd $WDIR
#############################################
##  LAUNCH part
#############################################
$NI_VERSIONS_DIR/LINUX/isis/hexpress2isis_no_interactivex86_64 -p=/cfd_work/share/PCT38_M3_T00_v5/PCT38_M3_T00_v5_r5 > hexpress2isis.log
$NI_VERSIONS_DIR/LINUX/isis/premetisx86_64 -npar=24 -sim=/cfd_work/share/PCT38_M3_T00_v5/PCT38_M3_T00_v5_r5/PCT38_M3_T00_v5_r5.sim -auto > premetis.log
$NI_VERSIONS_DIR/LINUX/_mpi/_ompi1.4/bin/mpirun -np 24 -machinefile /cfd_work/share/PCT38_M3_T00_v5/PCT38_M3_T00_v5_r5/machines.txt $NI_VERSIONS_DIR/LINUX/isis/isiscfdmpi_openmpix86_64 > /cfd_work/share/PCT38_M3_T00_v5/PCT38_M3_T00_v5_r5/PCT38_M3_T00_v5_r5.std
$NI_VERSIONS_DIR/LINUX/isis/isis2cfview_no_interactivex86_64 -p=/cfd_work/share/PCT38_M3_T00_v5/PCT38_M3_T00_v5_r5 > isis2cfview.log

 

 

If you were doing a local execution then things like machinefile and nproc can be entered in the FW but for a resourcemanager launch I would think it best if the SSHFM could populate these fields? Is that possible?
 

Share this post


Link to post
Share on other sites

Hi,

 

In general it should not be a problem that the actual executable is triggered by a script, as the default canceler class kills all child processes of the executable started (see the implementation I pointed to earlier). So in your case hexpressx86_64 is a child of /apps/numeca/hexpressmarine30_1 which in turn should be killed by the default implementation.

I will look into whether it is possible to implement a custom canceler that creates a stop.now file. First try would be to change the "getKillCommand(int pid)" method of the CancelerClass. E.g.:

@Override
public String getKillCommand(int pid) {
  return "echo -1 > stop.now";
}

However, I expect that the file will not be written into the correct directory. Maybe it's possible to find out the working directory in the getPIDCommand and store it inside the JobCanceler instance. I will try to implement it tomorrow and see whether it works.

 

 

The other case, where the script file is generated by FW is not really straight forward. We have intentionally not implemented that file flags (in this case the execute flag) of files transferred to the remote computers are restored (for security reasons). So in this case you'd have to enter the path to the shell executable (e.g. /bin/bash) into the SshResourceManager and pass the filename of the shellscript as part of the arguments of FW computation (for example, the arguments would be "myGeneratedShellscript.sh"). The shellscript needs to be written to the "input" directory of the computation so it is transferred to the executing host (if it is configured in the SoftwareConnector, that is already the case).

In the end, the shellscript will be located in the working directory of the job and the SshResourceManager will execute "/bin/bash myGeneratedShellscript.sh".

 

Currently the SshResMan cannot generate hostfiles or similar inputs as they are very different depending on the grid engine used. It might be added in future versions.

Share this post


Link to post
Share on other sites

Hi Britton,

 

I have attached a canceler class that creates a "stop.now" file in the working directory of the process.

 

However, there are some problems with it, to be honest: The way the Canceler usually works (i.e. "kill -9" or taskkill on Windows) the SshRM assumes that afterwards it will be actually gone and the working directory will be erased since it is assumed that the user is not interested in the canceled job.

In your case it is a controlled shutdown of the remote process. So if FineMarine does not pick up on the existance of the stop.now file fast enough, the file will be deleted too early going unnoticed. I will think about a way to overcome this, however it is quite a special case and in most cases the current implementation is what is wanted by the user.

 

What the new attached canceler implementation will do is execute a

echo -1 > `pwdx PID | cut -f 2 -d ' '`/stop.now

where PID is replaced with the process id of the started process.

To install it, extract the attached file into the "WEB-INF/classes/com/friendshipsystems" subdirectory of the SshResourceManager and add the line

cancelerToLoad1=com.friendshipsystems.farrdesign.FJobCancelerFineMarine

to the SshResourceManager.properties file. Restart the ResourceManager and you will be able to assign the new Canceler to the operating system.

 

Are you sure that the default implementation does not kill the child processes of the started shell script?

farrdesign.zip

Share this post


Link to post
Share on other sites

The next release of the SshResourceManager will not delete the remote files until the remote process has actually finished. First of all, this should make your scenario work, and second of all it is also a lot cleaner for the "general case" as it makes sure that resources are not marked as "free" before they are actually released for avoiding overusing CPUs or licenses on the remote hosts.

 

You can grab a snapshot of the new SshResourceManager here: https://www.friendship-systems.de/fcloud5/public.php?service=files&t=4a0fa8c6447b33f86cd0c39c5df71dd1

To perform the update it is the easiest to use the administration area of the ResourceManager's webinterface.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...