Skip to content

Running the DAQ

Emanuele Leonardi edited this page Nov 22, 2022 · 39 revisions

Logging in

To run the PADME DAQ system the shifter must log on l0padme1 as daq. The password can be obtained from the Run Coordinator (or from most members of the collaboration). After logging on, cd to directory DAQ.

[padme@padmecr4 ~]$ ssh -Y daq@l0padme1
daq@l0padme1's password: 
Last login: Mon Oct  8 09:39:46 2018 from padmecr4.lnf.infn.it
[daq@l0padme1 ~]$ cd DAQ
[daq@l0padme1 DAQ]$ 

N.B. if you are doing a remote shift (i.e. from outside the laboratory), you must first connect to the INFN-LNF VPN following the procedure described in the Guide for remote shifters.

Starting the RunControl server

All DAQ procedures are handled through the RunControl server. This is a daemon process running on l0padme1.

To verify if the process is running:

[daq@l0padme1 DAQ]$ ps -fu daq | grep RunControl
UID         PID   PPID  C STIME TTY          TIME CMD
...
daq      177988      1  0 10:05 ?        00:00:00 /usr/bin/python ./RunControl --server
...

If it is not running, please restart it:

[daq@l0padme1 DAQ]$ ./RunControl --server
Starting RunControlServer in background

All output from the RunControl server process is written to the log/RunControlServer.log file in the DAQ directory. Looking into this file can help troubleshooting DAQ problems.

Starting the RunControl client

The RunControl client is used to issue commands to the RunControl server (start a new run, stop the run, ...). To start the client:

[daq@l0padme1 DAQ]$ ./RunControl
Connecting to RunControl server on host localhost port 10000
SEND (q or Q to Quit):

This will start the RunControl client in text mode. All commands will be given from this terminal. The help command can be used to get a list of available commands at any point of the RunControl procedure.

SEND (q or Q to Quit): help
Sending help
Available commands:
help		                Show this help
get_state	                Show current state of RunControl
get_setup	                Show current setup name
get_setup_list	                Show list of available setups
get_board_list	                Show list of boards in use with current setup
get_board_config_daq <b>	Show current configuration of board DAQ process <b>
get_board_config_zsup <b>	Show current configuration of board ZSUP process <b>
get_trig_config	                Show current configuration of trigger process
get_run_number	                Return last run number in DB
change_setup <setup>            Change run setup to <setup>
new_run		                Initialize system for a new run
shutdown		        Tell RunControl server to exit (use with extreme care!)
SEND (q or Q to Quit):

Verifying and changing the setup

Before starting a new run it is wise to verify which setup is currently loaded and change it if needed. For the time being, unless told otherwise by the Run Coordinator, the correct setup is full2022 which will enable all ADC boards and acquire data from all PADME detectors. Before starting any run, please make sure that the setup is correct:

SEND (q or Q to Quit): get_setup
Sending get_setup
full2022

If asked by the Run Coordinator, you can change the setup, e.g.

SEND (q or Q to Quit): get_setup_list
Sending get_setup_list
['full2022', 'ecal_sac_cosmics', 'ecal_sac_cosmics_nozsup', 'target_sac_201907', 'test201907', 'test2020_nozsup']
SEND (q or Q to Quit): change_setup ecal_sac_cosmics
Sending change_setup ecal_sac_cosmics
ecal_sac_cosmics

The change_setup command is also used to reload a setup if any of its files did change (WARNING: only the Run Coordinator is allowed to edit the setup files):

SEND (q or Q to Quit): get_setup
Sending get_setup
full2022
SEND (q or Q to Quit): change_setup full2022
Sending change_setup full2022
full2022

Initializing a new run

Before initializing a new run it is good practice to verify if the DAQ system is clean and the ADC boards are accessible. The procedure for this is described in the Verifying DAQ status section below.

SEND (q or Q to Quit): new_run
Sending new_run
Current setup is full202007
Available run types: CALIBRATION,COSMICS,DAQ,FAKE,OTHER,RANDOM,TEST,TESTBEAM
Run type: DAQ
Sending DAQ
New run will be of type DAQ
New run will have run number 30046

Note: supported run types are TEST, DAQ, CALIBRATION, COSMICS, RANDOM, OTHER. The system also supports the FAKE and TESTBEAM run types but these are only to be used by experts.

N.B. Uppercase is mandatory.

WARNING: in some (hopefully rare) cases the new_run commands gets stuck, i.e. no response is shown. This is due to a glitch on the database connection. To restore the normal functioning of the RunControl just kill the RunControl server, as explained in the Exiting from the system section below, and restart it, as explained in the Starting the RunControl server section above. At this point start a new RunControl client and try again to initialize the run (N.B. Clean-up Procedure is not needed here).

Shift crew: Emanuele
Sending Emanuele
Start of run comment: My first test run
Sending My first test run

Both "Shift crew" and "Start of run comment" accept free format text of (almost) indefinite length. Try to be as detailed as possible in describing the run conditions (beam status, HV status, ADC boards included, special conditions, etc...).

Now the run initialization procedure can start. Expect a delay of several seconds before the first message is shown.

New run initialization start
level1 0 ready
level1 1 ready
...
merger ready
trigger init
adc 0 zsup_init
adc 1 zsup_init
...
adc 28 zsup_init
adc 0 daq_init
adc 1 daq_init
...
adc 28 daq_init
trigger ready
adc 0 ready
adc 1 ready
...
adc 28 ready
adc all ready
New run initialization completed correctly
init_ready

The initialization procedure for the full experiment (29 ADC boards) takes up to 2 minutes, so wait patiently.

In some occasions the initialization procedure can time-out, fail or get stuck. In any of this happens, please check the DAQ Troubleshooting page for the correct recovery procedure. If all recovery procedures fail, it is time to call an expert.

Starting a new run after initialization

SEND (q or Q to Quit): start_run
Sending start_run
Run started correctly
run_started

Stopping a run

SEND (q or Q to Quit): stop_run
Sending stop_run
End of run comment: My end of run
Sending My end of run
adc 0 daq_terminate_ok
adc 0 zsup_terminate_ok
...
adc 28 daq_terminate_ok
adc 28 zsup_terminate_ok
trigger terminate_ok
merger terminate_ok
level1 0 terminate_ok
level1 1 terminate_ok
...
Run terminated correctly
terminate_ok

Moving the DAQ client window to another terminal

Only a single RunControl client can connect to the RunControl server at any given time. If you want to move the client from one terminal to another, issue the Q command on the original client and then start the new one with the usual command: this procedure will not affect the RunControl server in any way (e.g. if a run is in progress it will keep taking data).

Please note that the client MUST NOT be stopped while the new_run procedure is in progress: this would leave the RunControl server in an indefinite state and will require stopping and restarting it.

If you are leaving after your shift and no one is coming after you, please close the client window: in this way, anybody will be able to take over and manage the RunControl even if they are not physically at the laboratory.

Exiting from the system

Use this procedure only if you want to stop the main RunControl server. This should be done only if the initialization or stop_run procedures fail and/or the system gets in a pathological state (no response to the client).

SEND (q or Q to Quit): shutdown
Sending shutdown
exiting
Server's gone. I'll take my leave as well...
Closing socket

If the server is stuck and does not respond to user commands (this can happen in rare cases, e.g. during the procedure to connect to the database), it can be killed with the Unix kill (or possibly kill -9) command:

[daq@l0padme1 DAQ]$ ps -fu daq | grep RunControl
UID         PID   PPID  C STIME TTY          TIME CMD
...
daq      177988      1  0 10:05 ?        00:00:00 /usr/bin/python ./RunControl --server
...
[daq@l0padme1 DAQ]$ kill 177988

Log files

All active processes created during the DAQ produce individual log files which can be very useful to verify if the DAQ is running smoothly. All log files for a given run are stored in a single directory named after the run (e.g. run_0000000_20181005_094240/log). This directory is created inside the DAQ/runs subdirectory (i.e. DAQ/runs/run_0000000_20181005_094240/log for the previous example).

To check if the trigger board is correctly receiving the trigger from the BTF:

[daq@l0padme1 log]$ tail -f run_0000000_20181005_094240_trigger.log 
... Some setup messages ...
2020/06/13 13:31:56 - Starting trigger generation
- Setting process status to RUNNING (5)
DBINFO - 2020/06/13 13:31:56 - process_set_status 5
DBINFO - 2020/06/13 13:31:56 - process_set_time_start 2020/06/13 13:31:56
- Enabling requested triggers.
trig_set_register cmd = 0x010203841002
Current trigger mask: 0x02
- Trigger         0 0x028e01061ebf9fcc   26285678540 0x01  142 0 1
- Trigger       100 0x02f2010627b72903   26436118787 0x01  242 0 1    1880.503ms    1671.407ms 53.18Hz
- TrigMsk 1605591101 0(93,92,48.92Hz) 1(4,4,2.13Hz) 3(2,2,1.06Hz) 7(2,2,1.06Hz)
- Trigger       200 0x0256010630964964   26584959332 0x01   86 0 1    1860.507ms    1850.141ms 53.75Hz
- TrigMsk 1605591102 0(184,91,48.91Hz) 1(10,6,3.22Hz) 3(4,2,1.07Hz) 7(3,1,0.54Hz)
... Trigger number keeps growing steadly ...

To check if the event merger is receiving data from all boards with the correct synchronization:

[daq@l0padme1 log]$ tail -f run_0000000_20181005_094240_merger.log 
... Some setup messages ...
Board 19 has id 23 and SN 223
Board 20 has id 27 and SN 182
- Written 100 events
  Event     100 size   43825 time 1592055124.879055715s clock  620781087 status 0001 trigger mask 02 fifo 00 auto 01 missing boards 00000000
- Written 200 events
  Event     200 size   43825 time 1592055133.315418273s clock 1296144461 status 0001 trigger mask 02 fifo 00 auto 01 missing boards 00000000
- Written 300 events
  Event     300 size   43825 time 1592055140.345031194s clock 1858490107 status 0001 trigger mask 02 fifo 00 auto 01 missing boards 00000000
... Number of written events keeps growing steadily ...

Any problem in the DAQ will immediately show up in the event merger log file. As most of the times problems with the merger process are related to problems with the ADC boards, the fastest recovery procedure is to reset all VME and NIM crates (see procedures to Reset VME crates and to Reset NIM Crates and Vetos) and do the Clean-up Procedure.

An example of problems linked to the trigger board loosing packets looks like this:

... all good up to now ...
- Written 7000 events
- Written 7100 events
*** Board  0 - Board time 357818173696 less than Trigger time 394468561288: skip event and try to recover
*** Board  0 - Board time 357838171793 less than Trigger time 394468561288: skip event and try to recover
*** Board  0 - Board time 357878168337 less than Trigger time 394468561288: skip event and try to recover
... problem messages keep repeating over and over ...

N.B. in some occasions, a single trigger event is not reported to the DAQ system by the Trigger Board. In this case the system will report the standard error messages but will be able to automatically recover from the problem: only stop the run if you see the error messages repeating over and over.

Verifying DAQ status

Before starting a new run, it is good practice to verify the overall status of the DAQ system, of the online VME/NIM crates and of the ADC boards. This is particularly important after a run was stopped because of a DAQ problem.

The scripts to be used for this are:

  • CleanupCheck to check the software status of the DAQ system
  • OnlineCrateCheck to check if all VME and NIM crates are up
  • ADCBoardCheck to check if all ADC boards are correctly accessible and ready for DAQ

The three scripts must be run on the l0padme1 node as user daq from the DAQ directory (i.e. from the same place where you run the RunControl). The expected output when the system is in the correct state are:

[daq@l0padme1 DAQ]$ ./CleanupCheck

=== Checking Merger/Level1 node l1padme3 ===
Node l1padme3 cleanup status OK

=== Checking Merger/Level1 node l1padme4 ===
Node l1padme4 cleanup status OK

=== Checking DAQ/ZSUP node l0padme4 ===
Node l0padme4 cleanup status OK

=== Checking DAQ/ZSUP node l0padme5 ===
Node l0padme5 cleanup status OK

=== Checking Trigger node l0padme1 ===
Node l0padme1 cleanup status OK

CleanupCheck OK - All DAQ nodes are in a clean state: you can restart the run

[daq@l0padme1 DAQ]$ ./OnlineCrateCheck
NIM Crate right READY
NIM Crate left  READY
VME Crate right READY
VME Crate left  READY

[daq@l0padme1 DAQ]$ ./ADCBoardCheck
Board 0 READY
Board 1 READY
Board 2 READY
Board 3 READY
Board 4 READY
Board 5 READY
Board 6 READY
Board 7 READY
Board 8 READY
Board 9 READY
Board 10 READY
Board 11 READY
Board 13 READY
Board 14 READY
Board 15 READY
Board 16 READY
Board 17 READY
Board 18 READY
Board 19 READY
Board 20 READY
Board 21 READY
Board 22 READY
Board 23 READY
Board 24 READY
Board 25 READY
Board 28 READY

Should any problem be found, this is reported by the scripts together with the instructions to solve them.

N.B. to avoid any possible interference with an ongoing run, no corrective action is taken by the scripts themselves.

Clone this wiki locally