Running DiFX at the VLBA correlator (VLBA Specific)
In this section are instructions for using the NRAO adapted DiFX to correlate data that has already been prepared for correlation by the VLBA hardware correlator. Some aspects of this section may still apply to correlation of other data. Currently no graphical user interface exists, so these instructions are command line only; these instructions will change when the operator interface is complete. This section assumes that the software is properly installed and environment variables are set appropriately for use which will be the case for correlator operators. While there are several steps in performing the software correlation, nothing is too complicated.
Note that this section will see considerable enhancements as experience in running the correlator is gained.
Software correlation based on vex files
With VLBA DiFX version 1.5 comes correlation based on the .vex files
rather than the hardware correlator jobs scripts. This new path frees
operations from a host of difficult to maintain software, including
cjobgen and its associated software. The vex-based correlation was
first documented a memo titled “VLBA-DIFX Operations Plan” [opsplan]
. Step-by-step instructions describing the
process is repeated here. The particular case being exemplified here is
based on the complicated pulsar astrometry project. Most real-life
examples will be simpler, but some may be more complex. Note that these
instructions represent the expected way to proceed, but changes to the
software architecture may introduces changes to some of these steps.
It should be kept in mind that all actions performed by the analysts
will be pass based which means one or more jobs at a time. Rarely will
analysts have to worry about individual jobs or FITS files. The
correlator operators on the other hand work entirely on the job basis.
Commands to be issued by the analysts are preceded by an arrow (
\(\longrightarrow\) ). In general, all files written in the
processes that follow are readable and writable by everyone in the
vlba_difx group. The exception is data sent to the archive staging
area, which is readable and writable only by e2emgr.
First change to the project directory. Assume that the project is called BX123 and that it was observed in December 2009.
\(\longrightarrow\)
cd /home/vlbiobs/astronomy/dec09/bx123Extract from the monitor database the Mark5 module logs, clock offsets and rates, and EOPs making a new vex file called
bx123.skd.obsand a file calledbx123.skd.shelf. The original (schedule) vex file that was used during observation is never to be modified. In extreme cases, the new vex file being created in this stepbx123.skd.obscan be hand edited to reflect what actually happened during observation, but doing this should be extremely rare. This step locks in the EOP values that will be used for each job made for this project.\(\longrightarrow\)
db2vex bx123.skdNext form the template input file for
vex2difxfrom the.omsfile written bysched. This createsbx123.v2d.\(\longrightarrow\)
oms2v2d bx123.omsFor simple experiments it is likely that the
.v2dfile created in the previous step can be used unmodified. For this complicated experiment changes will need to be made. Since this project requires four correlator passes, this.v2dfile will need to be copied four times and each one edited to reflect the purpose of the correlator pass. Sophisticated VLBA users may provide their own set of.v2dfiles that might need light editing before use.\(\longrightarrow\)
cp bx123.v2d clock.v2d\(\longrightarrow\)
emacs clock.v2dVLBA-DiFX
.inputfiles are generated at this point usingvex2difx. By design,vex2difxhas no options associated with it – it is entirely configured through the.v2dfiles. In the case below, the filesclock_1.input,clock_1.calc, andclock_1.flagwill be created. This command will also make a file calledclock.joblistthat lists each job created for this correlator pass with a summary of the job properties, such as start and stop times and number of stations.\(\longrightarrow\)
vex2difx clock.v2dIf the correlator jobs created above are deemed ready to run, they are sent to the correlator queue. In this process three things will occur: 1.
CALCwill be run to generate the correlator delay models needed for correlation, 2. the.inputfiles generated byvex2difxwill be copied to the software correlator run directory, and 3. the VLBA database will be told that the jobs are ready. At this time, a priority can be set to the jobs being sent to the correlator, making them appear at the top of the queue. Otherwise the jobs in the queue will appear in observe time order. In the example below, the option-p 1indicates that this job should run with elevated priority. Supplyingclockwith no prefix implies queuing all the jobs in the clock pass. Individual jobs could be queued by specifying a list of.inputfiles.\(\longrightarrow\)
difxqueue -p 1 add clockWhen the jobs are complete, which can be determined with
difxqueueusing thelistaction, the correlator output is converted to FITS format. Data “sniffing” happens automatically during this step. The command to do this will ensure that all of the jobs in the pass have been successfully correlated. Note that the number of FITS files created is not necessarily the same as the number of correlator jobs. A file calledclock.fitslistwill be generated in this step that lists all of the fits files that are part of this correlator pass including for each FITS file a list of the jobs that contributed to that FITS file. The programmakefitswill use programdifx2fitsto do the actual conversion.\(\longrightarrow\)
makefits clock.joblistThe sniffer output files are at this point inspected. Program
difxsniffis run to produce plots which are identical to those produced by sniffer today. Multiple reference antennas (in this example, Los Alamos and Kitt Peak) can be provided at the same time. Sniffer plots and the data that is used to generate them will be placed in a sub-directory of the project directory calledsniffer/clockfor a pass called “clock”.\(\longrightarrow\)
difxsniff LA KP clock.fitslist\(\longrightarrow\)
gv sniffer/clock/apdfile.psIf the FITS files are deemed acceptable, they are entered into the VLBA data archive.
\(\longrightarrow\)
difxarch clock.fitslistFinally, once data has been archived, the intermediate files should be removed from the head node:
\(\longrightarrow\)
difxclean bx123Note that this final step is needed only after the entire project is ready for releasing and should not be done every time between completion of job passes.
Directory contains undecoded scans
Occasionally a message of the form Modulemodule
directory contains undecoded scans! will appear. This means that one
or more scans for the module named module was not properly read or
decoded. This should first be verified by examining the module directory
file, called $MARK5_DIR_PATH/module.dir . This examination
is best done with program checkdir which looks for a number of
possible abnormalities. Rows where the eleventh column contains a
negative number are scans that were not decoded properly. A known
problem occasionally causes many consecutive scans at the end to be
improperly read and thus undecoded. If this is the case, rename the
existing directory file and try reacquiring the directory. Usually it
will start working immediately.
If one or a small number scans repeatedly cannot be decoded, the scan may be corrupted for some reason. In this case, simply delete the row(s) from the directory file and then decrement the number following the module name on the first line of the file by the number of scans deleted; this count of the number of scans listed in the file must remain accurate. This operation will cause the correlator to skip over these affected scans and data will be lost, so use appropriate judgement in these cases.
Directory read fails on partial module
Modules containing less than 8 working disks can be problematic. It is suggested that modules of this type have their directories read preemptively using a special command:
\(\longrightarrow\) mk5control safedirA 12
which is the command to safely read the directory of the module in
bank A of mark5fx12.
Mark5 unit hangs while reading directory
Typically the first thing one should do if a hang occurs is to try
again. For directory reading this can be attempted with the
mk5control program. For instance, if the module in bank A of
mark5fx12 hangs during the directory read, stop the correlation
process with the DOI or via stopmpifxcorr and then issue:
\(\longrightarrow\) mk5control getdirA 12
If this also fails, or never starts, reboot the unit via the DOI or
\(\longrightarrow\) mk5control reboot 12
or
\(\longrightarrow\) ssh 12 /sbin/reboot if it really refuses to
reboot.
Once the unit comes back, try retrieving the directory again.
Mark5 directory reading fails partway through
When the GUI button GetDir fails, the program mk5dir can be used
directly to read a module directory.
Things to try first:
Log into the fx unit and run
vsnto look for obvious module problemsMove the module
Erase (or move/rename) the preexisting directory file
Reboot the correlator Mark5 unit
When GetDir fails or crashes the Mark5 unit, it is likely because there
are one or more spots on the module that can’t be read. Using
mk5dir, you can read most of the directory while skipping any
problematic scans.
The mk5dir program will work on both Mark5A and Mark5C modules. It
is best to put them, respectively, in SDK 8 and 9 units. As with most
utilities, typing mk5dir by itself will print help information.
The output directory will be named the usual vsn.dir and will
overwrite any existing file of that name. It will be written to
$MARK5_DIR_PATH, which is the same place to which GetDir writes
directories.
The relevant options in this case are:
-f(force a directory read even if a file already exists)-v(be verbose)-escan number (stop reading the directory at a certain scan number)-bscan number (begin reading the directory at a certain scan number)
The scan numbering is worth noting. The command line options -e and
-b number the scans starting at 1 (the first scan is 1). But the
on-screen output of mk5dir will begin with a scan numbered 0.
The first step is to read as much of the directory before the first problem scan.
Log in to the fx unit
Run
vsnbank to get an overview of the health of the moduleRun
mk5dir -f -vbank
As it reads each scan, it will print a line indicating its progress.
0/228 -> 3 Decoded1/228 -> 3 Decodedetc …
The first number is the scan it just decoded and the second number is the total number of scans on the module. The “3” is related to the data format and should be 3 for Mark5C at all VLBA sites. VLA modules will show “4”. Legacy VLBA and foreign stations may show other numbers. The “Decoded” indicates success, as opposed to something like “XLR Read Error”.
Presumably, it will fail at some point. When it fails it probably won’t write an output directory file. Note which scan it failed to read. Remember that scan \(x/228\) is actually the \(x\)+1\(^{\mathrm th}\) scan because it started counting at zero. You may have to reboot the fx unit at this point.
Now run mk5dir again, this time stopping one scan previous to the
one where it died last time.
Run
vsnbank (whether you had to reboot or not, checking the module withvsnis probably a good idea)Run
mk5dir -f -v -e\(x\) bank (mk5dirwill stop once it has read \(x\) scans)Rename the output file so it doesn’t get overwritten.
Now we can skip past the bad parts and read the rest of the directory.
Run
mk5dir -f -v -b\(x\)+2 bank (mk5dirwill start with the scan after the one it originally failed on)
If it fails, reboot as necessary, run vsn again, and try starting with scan \(x\)+3 instead. Keep incrementing the start scan until it works; sometimes it might be faster to try a bigger jump, and on success work backwards to find where failing starts. Eventually, you will have skipped past the bad scans and read the rest of the directory. There are likely to only be one or two bad scans, so this step should actually be fairly simple.
Now you will have two directory files, the renamed file with the first
\(x\) scans, and the second file with all the scans after the bad
scan(s). Each file will also have entries for the scans that weren’t
read by that particular invocation of mk5dir, and these lines need
to be deleted. Then using your preference of linux commands and/or text
editor, combine the two files in the appropriate order. Make sure there
is a single header line at the top which lists the correct number of
scans.
If there are nonsequential bad scans you will have to concatenate three or more files, but the steps remain the same.
Mark5 unit hung
Unfortunately, there are still some instabilities with Mark5 units that
result in various kinds of hangs; some units appear more sensitive than
others. Often a failed Mark5 can be identified with the last few lines
of error messages output from mpifxcorr. To verify, first attempt to
ssh into that unit. If that is successful, try watching the output
of cpumon and mk5mon (or the equivalent from the DOI). If no
updates come from cpumon then it is likely that the computer has
seized and requires a hard reboot. Otherwise if mk5mon shows no
updates, the problem is likely with the Streamstor card and/or disk
module. If logging into the Mark5 unit works, try resetting the
Streamstor card with:
mk5control reset unitNumber
where unitNumber is, for example, 07 for mark5fx07 or 23 for
mark5fx23. The Mark5 state shown in mk5mon should change to
“Resetting”. If it does not, then it is likely a reboot is needed.
If none of the above works, try rebooting the particular Mark5 unit and
starting over. Note: as currently configured, a Mark5 unit will restart
the Mark5A upon boot, so you will need to use mk5take to stop
that before attempting software correlation on that unit again. Make
sure to give the Mark5 unit enough time to initialize the Mark5A
program before running mk5take (i.e., wait for module lights to
cycle).
A possibly more reliable way to identify a hung Mark5 unit is to start a
new instance of mk5mon (§[sec:mk5mon]) in a
terminal and issue the following command:
mk5control getvsn mark5
A hung Mark5 will not show up in the list of units.
Module moved
If a required module has been removed or moved since genmachines has
run, mpifxcorr will not be able to correlate. In this case DiFX will
fail, spitting out a substantial amount of debug information. You can
try again by running genmachines baseFilename.input to force
the recreation of the .machines file. If this program fails, it will
report an error that may aid in diagnostics. Note that this scenario
will not happen if the Difx Operator Interface or startdifx
(§[sec:startdifx]) is used to run the correlator.
Correlating data files
The operating instructions up to this point have focused on correlation
directly off Mark5 modules. Correlation off files is also supported, as
is a mixed mode where files and modules are correlated together. The
scripts described in this document don’t (to date) make correlation of
files easy, but it is possible to do so by hand editing files. It is
expected that enhancements to the scripts will make correlation from
files much easier in future versions of DiFX. Two files will need
manipulation: .input and .machines. In the .input file,
every entry in the DATASTREAM table that corresponds to a disk file
needs the DATA SOURCE value changed from MODULE to FILE. The
.machines file will likely have to be constructed completely by
hand. See §[sec:machines] for a detailed
description of the format of that file. Note that it is no longer
necessary for the data files to be visible to all cluster computers –
they can reside on local drives that are not exported, including USB or
Firewire drives, but this requires that the datastream nodes listed in
the .machines file be in the order in which the antennas are listed
in the .input file.
Note: you must use the -n option to startdifx when starting
the correlation or the hand-edited .machines file will be
overwritten.
The VLBA database
Many of the existing VLBA tools (such as the Observation Management
System (OMS), mon2db, cjobgen, and others) make use of an Oracle
database for persistent storage of various information related to
projects that use either the VLBA antennas or correlator. Many aspects
of VLBA-DiFX are not a good match for the existing database tables;
adapting the existing tables to work nicely with VLBA-DiFX will be
disruptive and have implications for much existing code, including
software that will not be needed once FXCORR is shut down. The proposed
solution to this dilemma is to use a parallel set of database tables for
correlation and archiving when using VLBA-DiFX. The use of existing
software for generation of FXCORR jobs will continue unchanged. For
projects to be correlated using VLBA-DiFX, OMS will still be used
for observation preparation tasks, but will not be used in preparation
of correlation or anything that occurs beyond that in the project’s life
cycle. Instead, vex2difx will be used to generate jobs,
difxqueue will be used in lieu of OMS to stage correlator jobs,
and difxarch will be used in the archiving of data. The queuing tool
difxqueue will be used to display the state of the VLBA-DiFX job
queue as well as populate it. The new tools will access three new
database tables: DIFXQUEUE and DIFXLOG; the contents of these tables is
shown in Tables tab-difxqueue & tab-difxlog.
FXQUEUE table currently used by
OMS. Entries to this table will be initially made bydifxqueue. The STATUS field will be automatically updated as appropriate during correlation. [tab:difxqueue]
Column
PROPOSAL
VARCHAR2(10)
The proposal code
SEGMENT
VARCHAR2(2)
Segment (epoch) of proposal, or blank
JOB_PASS
VARCHAR2(32)
Name of correlator pass (e.g. “geodesy”)
JOB_NUMBER
INT
Number of job in the pass
PRIORITY
INT
Number indicating the priority of the job in the queue
1 is highest
JOB_START
DATE
Observe time of job start
JOB_STOP
DATE
Observe time of job stop
SPEEDUP
FLOAT
Estimated speed-up factor for job
INPUT_FILE
VARCHAR2(512)
Full path of the VLBA-DiFX input file
STATUS
VARCHAR2(32)
Status of the job, perhaps “QUEUED”, “KILLED”,
“RUNNING”, ”FAILED”, ”UNKNOWN” or “COMPLETE”
NUM_ANT
INT
Number of antennas in the job
FXLOG table currently used by OMS. A row will be written to this
table after each successful correlation by the DiFX Operator
Interface. [tab:difxlog]
Column
PROPOSAL
VARCHAR2(10)
The proposal code
SEGMENT
VARCHAR2(2)
Segment (epoch) of proposal, or blank
JOB_PASS
VARCHAR2(32)
Name of correlator pass (e.g. “geodesy”)
JOB_NUMBER
INT
Number of job in the pass
CORR_START
DATE
Start time/date of correlation
CORR_STOP
DATE
Stop time/date of correlation
SPEEDUP
FLOAT
Measured speed-up factor
INPUT_FILE
VARCHAR2(512)
File name of .input file
OUTPUT_FILE
VARCHAR2(512)
File name of correlator output
OUTPUT_SIZE
INT
Size (in \(10^6\) bytes) of correlator output
CORR_STATUS
VARCHAR2(32)
Status of correlation, typically “COMPLETED”
Archiving
Archiving of VLBA-DiFX data will be done on a per-pass basis. All
.FITS files associated with a single correlator pass will be
archived together. A particular staging directory for VLBA-DiFX data has
been set up. Populating the archive amounts to first copying the files
to be archived to this directory making sure that the first character of
the file name is “.”. Once the entire file is transferred this file is
renamed without the leading period. This system is the standard way to
populate the Next Generation Archive System (NGAS) [ngas]
without potential for an incompletely copied
file to be archived. The file names will be composed only of
alpha-numeric characters and “.” and “_”. These characters have no
special meaning in any relevant software, including http, XML,
bash/Linux command lines, the Oracle database parser, etc. File names
will have the following format:
VLBA_projectCode_passName_fileNum_corrDateTcorrTime.idifits
where the italicized fields, which themselves will be limited to alphanumeric characters, are as follows:
Field |
Type |
Comment |
|---|---|---|
projectCode |
string |
Project code, including segment if appropriate |
passName |
string |
Name of the pass, as set in the
|
fileNum |
integer |
FITS file sequence number within pass |
corrDate |
date (yymmdd) |
Date corresponding to correlation completion |
corrTime |
time (hhmmss) |
Time corresponding to correlation completion |
Parameter fileNum is the sequence number of the created .FITS file
which may or may not have a direct correspondence with the job sequence
number within the correlator pass. An example archive file name relevant
to the sample project used in this memo may be:
VLBA_BX123_clock_1_091223T032133.idifits
All files produced for a given pass will be placed in a single directory,
$DIFX_ARCHIVE_ROOT/projectCode/passName
where DIFX_ARCHIVE_ROOT/ is an environment variable pointing to the
head of the archive staging area for VLBA-DiFX. During the transfer to
the archive, the projectCode portion of the directory tree will begin
with a period that is to be renamed once all files are completely
copied. This will allow the archive loader to logically group together
all the files of the pass. If needed, an index file listing the
association of archive .FITS files and correlator jobs can also be
placed in this directory. In order to ensure the atomic nature of
correlator passes in the archive, the renaming of the copied files from
the temporary versions starting with “.” will not occur until all
archive files are transferred. The .fitslist file produced by
difx2fits would serve this purpose. An archive loader will
periodically (initially about every 30 minutes, but perhaps later with
much shorter intervals) look for new files in the archive staging area
to store. The archive data will be available moments later for users
wanting to download the data.
Ownership and permissions of files
This section describes how the different user accounts interact within the DiFX correlation process. This is VLBA operations specific. No fewer than 4 user accounts are used in the life cycle of a DiFX project:
analysts: The analysts account is used to prepare the
.inputand other files used to run DiFX. These files retain analysts ownership in the/home/vlbiobs/astronomyarea and while queued.root: The root account owns the
mk5daemonprocesses that run on each of the Mark5 and software correlator computers.difx: User account difx is used to actually run
mpifxcorr(anddifxlog) during correlation. This is important, as the difx account has all of the proper environment variables set up. The DiFX Operator Interface may also be run as user difx.e2emgr: The data archive requires that the files staged for entry into the archive be given e2emgr ownership. This is accomplished with the prgram
e2ecopywhich runs with root priviledges and hence can change its userid to e2emgr as needed.
By default, all files will have owner and group read/write permission and global read permission.
Copying baseband data
Some users wish to perform their own analysis on baseband VLBI data. This section describes the procedure for copying data and ends with guidelines that should be sent to each user requesting data to be copied. The guidelines are there to streamline the process and to minimize the change of problems.
Performing the data copy
FIXME : write me!
Guidelines to users
Users wishing to retain a copy of the baseband data should make sure to conform to the following guidelines. Please note that many of the instructions below are there to ensure that root access is not required in the copying of the data. If root access is needed as a result of failure to comply, a delay in the copying will be incurred.
Arrangement for data copy should be made prior to observing.
The external disk must have a working USB2 connector for data transfer. We specifically do not support Firewire at this time. We do not support power-over-USB drives.
Each disk should have a sticker label attached with the owner’s name, institution, phone number, shipping address and email. The project code should be clearly marked as well. If multiple disks are shipped, each should have a unique serial number clearly labeled on the disk.
The external disk must be preformatted with a standard Linux filesystem (ext2 or ext3). It is preferred that the options
-m 0 -T largefile4are used withmke2fsand it is also suggusted that the-Loption is used to specify a volume name/number that matches the labelled serial number.It is the user’s responsibility to ensure that sufficient post-formatted capacity is available on the disks.
A world-writable root-level directory with the name of each project to be copied should be made. The directory name should contain only uppercase letters and numbers.
The disk should be empty upon delivery. NRAO will not be responsible for data that is deleted. The root-level directory(s) mentioned above should be the only exception(s).
Include the power transformer/cable and USB cable with the disk. It is recommended that owner labels be attached to each of these.
Note to foreign users: please ensure that the power supply has either a NEMA 1-15 / Type A ungrounded power connector or a NEMA 5-15 / Type B grounded power connector and that the power supply works on 110V/60Hz. If this is accomplished using an adapter, it must be included in the box with the disk.
Foreign users will be responsible for customs charges and are encouraged to contact NRAO well in advance of any shipment to minimize cost.
Some comments on channels
This section discusses the accountability of channel identification through the entire DiFX system. While much of this discussion will not be of use outside NRAO, the terminology discussed here might help explain other portions of this document. The subject of this section is baseband channels, not individual frequency channels (spectral points) that the correlator produces from the baseband channels.
Baseband channels are individual digital data streams containing a
time-series of sampled voltages representing data from a particular
portion of the spectrum from one polarization. Each baseband channel is
assigned a recorder channel number. For a given baseband data format
(i.e., VLBA, Mark4, Mark5B, …) a particular recorder channel number is
assigned to a fixed number of tracks or bitstreams. This mapping is
contained in the track row in the format table of the .fx
job script and can be different for each antenna. This mapping is also
reflected in the .input file in the datastream table.
DiFX correlates baseband channels from multiple antennas to produce visibilities. From each correlated baseline, one or two basebands from one telescope will be correlated against one or two basebands of another, resulting in up to four products for a particular sub-band. This is to allow full polarization correlation. Each sub-band (called an IF in AIPS) is given a sub-band number; in general 1 or 2 recorder channels map to each sub-band. Note that an observation can simultaneously observe some sub-bands consisting of only one baseband and some with two basebands. In cases such as this the matrix containing the visibility products on a particular baseline will be large enough in each dimension (i.e., polarization product, sub-band) to contain all of the results, even if this consumes more storage than necessary; flags are written that invalidate portions of the visibility matrix that are not produced by the correlator.