Running DiFX at the VLBA correlator (VLBA Specific)

In this section are instructions for using the NRAO adapted DiFX to correlate data that has already been prepared for correlation by the VLBA hardware correlator. Some aspects of this section may still apply to correlation of other data. Currently no graphical user interface exists, so these instructions are command line only; these instructions will change when the operator interface is complete. This section assumes that the software is properly installed and environment variables are set appropriately for use which will be the case for correlator operators. While there are several steps in performing the software correlation, nothing is too complicated.

Note that this section will see considerable enhancements as experience in running the correlator is gained.

Software correlation based on vex files

With VLBA DiFX version 1.5 comes correlation based on the .vex files rather than the hardware correlator jobs scripts. This new path frees operations from a host of difficult to maintain software, including cjobgen and its associated software. The vex-based correlation was first documented a memo titled “VLBA-DIFX Operations Plan” [opsplan] . Step-by-step instructions describing the process is repeated here. The particular case being exemplified here is based on the complicated pulsar astrometry project. Most real-life examples will be simpler, but some may be more complex. Note that these instructions represent the expected way to proceed, but changes to the software architecture may introduces changes to some of these steps.

It should be kept in mind that all actions performed by the analysts will be pass based which means one or more jobs at a time. Rarely will analysts have to worry about individual jobs or FITS files. The correlator operators on the other hand work entirely on the job basis. Commands to be issued by the analysts are preceded by an arrow ( $\longrightarrow$ ). In general, all files written in the processes that follow are readable and writable by everyone in the vlba_difx group. The exception is data sent to the archive staging area, which is readable and writable only by e2emgr.

First change to the project directory. Assume that the project is called BX123 and that it was observed in December 2009.

$\longrightarrow$ cd /home/vlbiobs/astronomy/dec09/bx123
Extract from the monitor database the Mark5 module logs, clock offsets and rates, and EOPs making a new vex file called bx123.skd.obs and a file called bx123.skd.shelf. The original (schedule) vex file that was used during observation is never to be modified. In extreme cases, the new vex file being created in this step bx123.skd.obs can be hand edited to reflect what actually happened during observation, but doing this should be extremely rare. This step locks in the EOP values that will be used for each job made for this project.

$\longrightarrow$ db2vex bx123.skd
Next form the template input file for vex2difx from the .oms file written by sched . This creates bx123.v2d .

$\longrightarrow$ oms2v2d bx123.oms
For simple experiments it is likely that the .v2d file created in the previous step can be used unmodified. For this complicated experiment changes will need to be made. Since this project requires four correlator passes, this .v2d file will need to be copied four times and each one edited to reflect the purpose of the correlator pass. Sophisticated VLBA users may provide their own set of .v2d files that might need light editing before use.

$\longrightarrow$ cp bx123.v2d clock.v2d

$\longrightarrow$ emacs clock.v2d
VLBA-DiFX .input files are generated at this point using vex2difx. By design, vex2difx has no options associated with it – it is entirely configured through the .v2d files. In the case below, the files clock_1.input, clock_1.calc, and clock_1.flag will be created. This command will also make a file called clock.joblist that lists each job created for this correlator pass with a summary of the job properties, such as start and stop times and number of stations.

$\longrightarrow$ vex2difx clock.v2d
If the correlator jobs created above are deemed ready to run, they are sent to the correlator queue. In this process three things will occur: 1. CALC will be run to generate the correlator delay models needed for correlation, 2. the .input files generated by vex2difx will be copied to the software correlator run directory, and 3. the VLBA database will be told that the jobs are ready. At this time, a priority can be set to the jobs being sent to the correlator, making them appear at the top of the queue. Otherwise the jobs in the queue will appear in observe time order. In the example below, the option -p 1 indicates that this job should run with elevated priority. Supplying clock with no prefix implies queuing all the jobs in the clock pass. Individual jobs could be queued by specifying a list of .input files.

$\longrightarrow$ difxqueue -p 1 add clock
When the jobs are complete, which can be determined with difxqueue using the list action, the correlator output is converted to FITS format. Data “sniffing” happens automatically during this step. The command to do this will ensure that all of the jobs in the pass have been successfully correlated. Note that the number of FITS files created is not necessarily the same as the number of correlator jobs. A file called clock.fitslist will be generated in this step that lists all of the fits files that are part of this correlator pass including for each FITS file a list of the jobs that contributed to that FITS file. The program makefits will use program difx2fits to do the actual conversion.

$\longrightarrow$ makefits clock.joblist
The sniffer output files are at this point inspected. Program difxsniff is run to produce plots which are identical to those produced by sniffer today. Multiple reference antennas (in this example, Los Alamos and Kitt Peak) can be provided at the same time. Sniffer plots and the data that is used to generate them will be placed in a sub-directory of the project directory called sniffer/clock for a pass called “clock”.

$\longrightarrow$ difxsniff LA KP clock.fitslist

$\longrightarrow$ gv sniffer/clock/apdfile.ps
If the FITS files are deemed acceptable, they are entered into the VLBA data archive.

$\longrightarrow$ difxarch clock.fitslist
Finally, once data has been archived, the intermediate files should be removed from the head node:

$\longrightarrow$ difxclean bx123

Note that this final step is needed only after the entire project is ready for releasing and should not be done every time between completion of job passes.

Directory contains undecoded scans

Occasionally a message of the form Modulemodule directory contains undecoded scans! will appear. This means that one or more scans for the module named module was not properly read or decoded. This should first be verified by examining the module directory file, called $MARK5_DIR_PATH/module.dir . This examination is best done with program checkdir which looks for a number of possible abnormalities. Rows where the eleventh column contains a negative number are scans that were not decoded properly. A known problem occasionally causes many consecutive scans at the end to be improperly read and thus undecoded. If this is the case, rename the existing directory file and try reacquiring the directory. Usually it will start working immediately.

If one or a small number scans repeatedly cannot be decoded, the scan may be corrupted for some reason. In this case, simply delete the row(s) from the directory file and then decrement the number following the module name on the first line of the file by the number of scans deleted; this count of the number of scans listed in the file must remain accurate. This operation will cause the correlator to skip over these affected scans and data will be lost, so use appropriate judgement in these cases.

Directory read fails on partial module

Modules containing less than 8 working disks can be problematic. It is suggested that modules of this type have their directories read preemptively using a special command:

$\longrightarrow$ mk5control safedirA 12

which is the command to safely read the directory of the module in bank A of mark5fx12.

Mark5 unit hangs while reading directory

Typically the first thing one should do if a hang occurs is to try again. For directory reading this can be attempted with the mk5control program. For instance, if the module in bank A of mark5fx12 hangs during the directory read, stop the correlation process with the DOI or via stopmpifxcorr and then issue:

$\longrightarrow$ mk5control getdirA 12

If this also fails, or never starts, reboot the unit via the DOI or

$\longrightarrow$ mk5control reboot 12

or

$\longrightarrow$ ssh 12 /sbin/reboot if it really refuses to reboot.

Once the unit comes back, try retrieving the directory again.

Mark5 directory reading fails partway through

When the GUI button GetDir fails, the program mk5dir can be used directly to read a module directory.

Things to try first:

Log into the fx unit and run vsn to look for obvious module problems
Move the module
Erase (or move/rename) the preexisting directory file
Reboot the correlator Mark5 unit

When GetDir fails or crashes the Mark5 unit, it is likely because there are one or more spots on the module that can’t be read. Using mk5dir, you can read most of the directory while skipping any problematic scans.

The mk5dir program will work on both Mark5A and Mark5C modules. It is best to put them, respectively, in SDK 8 and 9 units. As with most utilities, typing mk5dir by itself will print help information.

The output directory will be named the usual vsn.dir and will overwrite any existing file of that name. It will be written to $MARK5_DIR_PATH, which is the same place to which GetDir writes directories.

The relevant options in this case are:

-f (force a directory read even if a file already exists)
-v (be verbose)
-e scan number (stop reading the directory at a certain scan number)
-b scan number (begin reading the directory at a certain scan number)

The scan numbering is worth noting. The command line options -e and -b number the scans starting at 1 (the first scan is 1). But the on-screen output of mk5dir will begin with a scan numbered 0.

The first step is to read as much of the directory before the first problem scan.

Log in to the fx unit
Run vsn bank to get an overview of the health of the module
Run mk5dir -f -v bank

As it reads each scan, it will print a line indicating its progress.

0/228 -> 3 Decoded

1/228 -> 3 Decoded

etc …

The first number is the scan it just decoded and the second number is the total number of scans on the module. The “3” is related to the data format and should be 3 for Mark5C at all VLBA sites. VLA modules will show “4”. Legacy VLBA and foreign stations may show other numbers. The “Decoded” indicates success, as opposed to something like “XLR Read Error”.

Presumably, it will fail at some point. When it fails it probably won’t write an output directory file. Note which scan it failed to read. Remember that scan $x/228$ is actually the $x$+1$^{\mathrm th}$ scan because it started counting at zero. You may have to reboot the fx unit at this point.

Now run mk5dir again, this time stopping one scan previous to the one where it died last time.

Run vsn bank (whether you had to reboot or not, checking the module with vsn is probably a good idea)
Run mk5dir -f -v -e $x$ bank (mk5dir will stop once it has read $x$ scans)
Rename the output file so it doesn’t get overwritten.

Now we can skip past the bad parts and read the rest of the directory.

Run mk5dir -f -v -b $x$+2 bank (mk5dir will start with the scan after the one it originally failed on)

If it fails, reboot as necessary, run vsn again, and try starting with scan $x$+3 instead. Keep incrementing the start scan until it works; sometimes it might be faster to try a bigger jump, and on success work backwards to find where failing starts. Eventually, you will have skipped past the bad scans and read the rest of the directory. There are likely to only be one or two bad scans, so this step should actually be fairly simple.

Now you will have two directory files, the renamed file with the first $x$ scans, and the second file with all the scans after the bad scan(s). Each file will also have entries for the scans that weren’t read by that particular invocation of mk5dir, and these lines need to be deleted. Then using your preference of linux commands and/or text editor, combine the two files in the appropriate order. Make sure there is a single header line at the top which lists the correct number of scans.

If there are nonsequential bad scans you will have to concatenate three or more files, but the steps remain the same.

Mark5 unit hung

Unfortunately, there are still some instabilities with Mark5 units that result in various kinds of hangs; some units appear more sensitive than others. Often a failed Mark5 can be identified with the last few lines of error messages output from mpifxcorr. To verify, first attempt to ssh into that unit. If that is successful, try watching the output of cpumon and mk5mon (or the equivalent from the DOI). If no updates come from cpumon then it is likely that the computer has seized and requires a hard reboot. Otherwise if mk5mon shows no updates, the problem is likely with the Streamstor card and/or disk module. If logging into the Mark5 unit works, try resetting the Streamstor card with:

mk5control reset unitNumber

where unitNumber is, for example, 07 for mark5fx07 or 23 for mark5fx23. The Mark5 state shown in mk5mon should change to “Resetting”. If it does not, then it is likely a reboot is needed.

If none of the above works, try rebooting the particular Mark5 unit and starting over. Note: as currently configured, a Mark5 unit will restart the Mark5A upon boot, so you will need to use mk5take to stop that before attempting software correlation on that unit again. Make sure to give the Mark5 unit enough time to initialize the Mark5A program before running mk5take (i.e., wait for module lights to cycle).

A possibly more reliable way to identify a hung Mark5 unit is to start a new instance of mk5mon (§[sec:mk5mon]) in a terminal and issue the following command:

mk5control getvsn mark5

A hung Mark5 will not show up in the list of units.

Module moved

If a required module has been removed or moved since genmachines has run, mpifxcorr will not be able to correlate. In this case DiFX will fail, spitting out a substantial amount of debug information. You can try again by running genmachines baseFilename.input to force the recreation of the .machines file. If this program fails, it will report an error that may aid in diagnostics. Note that this scenario will not happen if the Difx Operator Interface or startdifx (§[sec:startdifx]) is used to run the correlator.

Correlating data files

The operating instructions up to this point have focused on correlation directly off Mark5 modules. Correlation off files is also supported, as is a mixed mode where files and modules are correlated together. The scripts described in this document don’t (to date) make correlation of files easy, but it is possible to do so by hand editing files. It is expected that enhancements to the scripts will make correlation from files much easier in future versions of DiFX. Two files will need manipulation: .input and .machines. In the .input file, every entry in the DATASTREAM table that corresponds to a disk file needs the DATA SOURCE value changed from MODULE to FILE. The .machines file will likely have to be constructed completely by hand. See §[sec:machines] for a detailed description of the format of that file. Note that it is no longer necessary for the data files to be visible to all cluster computers – they can reside on local drives that are not exported, including USB or Firewire drives, but this requires that the datastream nodes listed in the .machines file be in the order in which the antennas are listed in the .input file.

Note: you must use the -n option to startdifx when starting the correlation or the hand-edited .machines file will be overwritten.

The VLBA database

Many of the existing VLBA tools (such as the Observation Management System (OMS), mon2db, cjobgen, and others) make use of an Oracle database for persistent storage of various information related to projects that use either the VLBA antennas or correlator. Many aspects of VLBA-DiFX are not a good match for the existing database tables; adapting the existing tables to work nicely with VLBA-DiFX will be disruptive and have implications for much existing code, including software that will not be needed once FXCORR is shut down. The proposed solution to this dilemma is to use a parallel set of database tables for correlation and archiving when using VLBA-DiFX. The use of existing software for generation of FXCORR jobs will continue unchanged. For projects to be correlated using VLBA-DiFX, OMS will still be used for observation preparation tasks, but will not be used in preparation of correlation or anything that occurs beyond that in the project’s life cycle. Instead, vex2difx will be used to generate jobs, difxqueue will be used in lieu of OMS to stage correlator jobs, and difxarch will be used in the archiving of data. The queuing tool difxqueue will be used to display the state of the VLBA-DiFX job queue as well as populate it. The new tools will access three new database tables: DIFXQUEUE and DIFXLOG; the contents of these tables is shown in Tables tab-difxqueue & tab-difxlog.

FXQUEUE table currently used by OMS. Entries to this table will be initially made by difxqueue. The STATUS field will be automatically updated as appropriate during correlation. [tab:difxqueue]

Column

PROPOSAL

VARCHAR2(10)

The proposal code

SEGMENT

VARCHAR2(2)

Segment (epoch) of proposal, or blank

JOB_PASS

VARCHAR2(32)

Name of correlator pass (e.g. “geodesy”)

JOB_NUMBER

INT

Number of job in the pass

PRIORITY

INT

Number indicating the priority of the job in the queue

1 is highest

JOB_START

DATE

Observe time of job start

JOB_STOP

DATE

Observe time of job stop

SPEEDUP

FLOAT

Estimated speed-up factor for job

INPUT_FILE

VARCHAR2(512)

Full path of the VLBA-DiFX input file

STATUS

VARCHAR2(32)

Status of the job, perhaps “QUEUED”, “KILLED”,

“RUNNING”, ”FAILED”, ”UNKNOWN” or “COMPLETE”

NUM_ANT

INT

Number of antennas in the job

FXLOG table currently used by OMS. A row will be written to this table after each successful correlation by the DiFX Operator Interface. [tab:difxlog]

Column

PROPOSAL

VARCHAR2(10)

The proposal code

SEGMENT

VARCHAR2(2)

Segment (epoch) of proposal, or blank

JOB_PASS

VARCHAR2(32)

Name of correlator pass (e.g. “geodesy”)

JOB_NUMBER

INT

Number of job in the pass

CORR_START

DATE

Start time/date of correlation

CORR_STOP

DATE

Stop time/date of correlation

SPEEDUP

FLOAT

Measured speed-up factor

INPUT_FILE

VARCHAR2(512)

File name of .input file

OUTPUT_FILE

VARCHAR2(512)

File name of correlator output

OUTPUT_SIZE

INT

Size (in $10^6$ bytes) of correlator output

CORR_STATUS

VARCHAR2(32)

Status of correlation, typically “COMPLETED”

Archiving

Archiving of VLBA-DiFX data will be done on a per-pass basis. All .FITS files associated with a single correlator pass will be archived together. A particular staging directory for VLBA-DiFX data has been set up. Populating the archive amounts to first copying the files to be archived to this directory making sure that the first character of the file name is “.”. Once the entire file is transferred this file is renamed without the leading period. This system is the standard way to populate the Next Generation Archive System (NGAS) [ngas] without potential for an incompletely copied file to be archived. The file names will be composed only of alpha-numeric characters and “.” and “_”. These characters have no special meaning in any relevant software, including http, XML, bash/Linux command lines, the Oracle database parser, etc. File names will have the following format:

VLBA_projectCode_passName_fileNum_corrDateTcorrTime.idifits

where the italicized fields, which themselves will be limited to alphanumeric characters, are as follows:

Field	Type	Comment
projectCode	string	Project code, including segment if appropriate
passName	string	Name of the pass, as set in the `.v2d` file
fileNum	integer	FITS file sequence number within pass
corrDate	date (yymmdd)	Date corresponding to correlation completion
corrTime	time (hhmmss)	Time corresponding to correlation completion

Parameter fileNum is the sequence number of the created .FITS file which may or may not have a direct correspondence with the job sequence number within the correlator pass. An example archive file name relevant to the sample project used in this memo may be:

VLBA_BX123_clock_1_091223T032133.idifits

All files produced for a given pass will be placed in a single directory,

$DIFX_ARCHIVE_ROOT/projectCode/passName

where DIFX_ARCHIVE_ROOT/ is an environment variable pointing to the head of the archive staging area for VLBA-DiFX. During the transfer to the archive, the projectCode portion of the directory tree will begin with a period that is to be renamed once all files are completely copied. This will allow the archive loader to logically group together all the files of the pass. If needed, an index file listing the association of archive .FITS files and correlator jobs can also be placed in this directory. In order to ensure the atomic nature of correlator passes in the archive, the renaming of the copied files from the temporary versions starting with “.” will not occur until all archive files are transferred. The .fitslist file produced by difx2fits would serve this purpose. An archive loader will periodically (initially about every 30 minutes, but perhaps later with much shorter intervals) look for new files in the archive staging area to store. The archive data will be available moments later for users wanting to download the data.

Ownership and permissions of files

This section describes how the different user accounts interact within the DiFX correlation process. This is VLBA operations specific. No fewer than 4 user accounts are used in the life cycle of a DiFX project:

analysts: The analysts account is used to prepare the .input and other files used to run DiFX. These files retain analysts ownership in the /home/vlbiobs/astronomy area and while queued.
root: The root account owns the mk5daemon processes that run on each of the Mark5 and software correlator computers.
difx: User account difx is used to actually run mpifxcorr (and difxlog) during correlation. This is important, as the difx account has all of the proper environment variables set up. The DiFX Operator Interface may also be run as user difx.
e2emgr: The data archive requires that the files staged for entry into the archive be given e2emgr ownership. This is accomplished with the prgram e2ecopy which runs with root priviledges and hence can change its userid to e2emgr as needed.

By default, all files will have owner and group read/write permission and global read permission.

Copying baseband data

Some users wish to perform their own analysis on baseband VLBI data. This section describes the procedure for copying data and ends with guidelines that should be sent to each user requesting data to be copied. The guidelines are there to streamline the process and to minimize the change of problems.

Performing the data copy

FIXME : write me!

Guidelines to users

Users wishing to retain a copy of the baseband data should make sure to conform to the following guidelines. Please note that many of the instructions below are there to ensure that root access is not required in the copying of the data. If root access is needed as a result of failure to comply, a delay in the copying will be incurred.

Arrangement for data copy should be made prior to observing.
The external disk must have a working USB2 connector for data transfer. We specifically do not support Firewire at this time. We do not support power-over-USB drives.
Each disk should have a sticker label attached with the owner’s name, institution, phone number, shipping address and email. The project code should be clearly marked as well. If multiple disks are shipped, each should have a unique serial number clearly labeled on the disk.
The external disk must be preformatted with a standard Linux filesystem (ext2 or ext3). It is preferred that the options -m 0 -T largefile4 are used with mke2fs and it is also suggusted that the -L option is used to specify a volume name/number that matches the labelled serial number.
It is the user’s responsibility to ensure that sufficient post-formatted capacity is available on the disks.
A world-writable root-level directory with the name of each project to be copied should be made. The directory name should contain only uppercase letters and numbers.
The disk should be empty upon delivery. NRAO will not be responsible for data that is deleted. The root-level directory(s) mentioned above should be the only exception(s).
Include the power transformer/cable and USB cable with the disk. It is recommended that owner labels be attached to each of these.
Note to foreign users: please ensure that the power supply has either a NEMA 1-15 / Type A ungrounded power connector or a NEMA 5-15 / Type B grounded power connector and that the power supply works on 110V/60Hz. If this is accomplished using an adapter, it must be included in the box with the disk.
Foreign users will be responsible for customs charges and are encouraged to contact NRAO well in advance of any shipment to minimize cost.

Some comments on channels

This section discusses the accountability of channel identification through the entire DiFX system. While much of this discussion will not be of use outside NRAO, the terminology discussed here might help explain other portions of this document. The subject of this section is baseband channels, not individual frequency channels (spectral points) that the correlator produces from the baseband channels.

Baseband channels are individual digital data streams containing a time-series of sampled voltages representing data from a particular portion of the spectrum from one polarization. Each baseband channel is assigned a recorder channel number. For a given baseband data format (i.e., VLBA, Mark4, Mark5B, …) a particular recorder channel number is assigned to a fixed number of tracks or bitstreams. This mapping is contained in the track row in the format table of the .fx job script and can be different for each antenna. This mapping is also reflected in the .input file in the datastream table.

DiFX correlates baseband channels from multiple antennas to produce visibilities. From each correlated baseline, one or two basebands from one telescope will be correlated against one or two basebands of another, resulting in up to four products for a particular sub-band. This is to allow full polarization correlation. Each sub-band (called an IF in AIPS) is given a sub-band number; in general 1 or 2 recorder channels map to each sub-band. Note that an observation can simultaneously observe some sub-bands consisting of only one baseband and some with two basebands. In cases such as this the matrix containing the visibility products on a particular baseline will be large enough in each dimension (i.e., polarization product, sub-band) to contain all of the results, even if this consumes more storage than necessary; flags are written that invalidate portions of the visibility matrix that are not produced by the correlator.