Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

what's the proper way to read MeerKAT visibility out? #218

Open
astroJyWang opened this issue Feb 5, 2019 · 5 comments
Open

what's the proper way to read MeerKAT visibility out? #218

astroJyWang opened this issue Feb 5, 2019 · 5 comments

Comments

@astroJyWang
Copy link

astroJyWang commented Feb 5, 2019

@ludwigschwardt
if I write
vis = data.vis

it's quick

but if I write
vis = data.vis[:,:,:128]

it is a long long time...

The reason why it is slow seems to be caused by the limited network speed (10MBps for me) or inefficient readout method (e.g., maybe continuously reading can be much more fast?).
So what is the proper way to read out a segment of the visibility data (for me I'm interested in auto-corr data).

@ludwigschwardt
Copy link
Contributor

There is a big difference between data.vis and data.vis[:]. The first one is a lazy representation of your data, but not the data itself. The second one fetches the actual data. That's why the latter is slow :-)

The main issue with downloads is the chunking scheme. The data is split into smallish chunks (about 10 MB is a typical size). It is first split in time (i.e. every dump has a different set of chunks) and then in frequency (into contiguous parts of the band). The baselines are not split.

This means that it is very inefficient to get the data for a single antenna (or just the autos, for example). You always get all the baselines with every chunk. This is optimised for imaging but maybe not for what you are doing.

One tip is not to loop over the antennas when downloading data. That could speed things up by a factor of 64. In your case it seems you are already doing this, based on the [..., :128] part. Also, do as much selection up front (especially in time) to speed things up. Throw out slews and parts of the band you don't need.

For high-speed processing you really need to be running your reductions in the CHPC though.

@astroJyWang
Copy link
Author

astroJyWang commented Feb 6, 2019

so, if we move the native data to IDIA and do analysis there, the chuck problem can be avoided?
and, why MeerKAT choose rdb instead of h5 now? thanks.

@ludwigschwardt
Copy link
Contributor

Possibly, if you have fast access to the distributed filesystem where the data will be stored.

We had to move away from a single file because it would be too large for the datasets we expect, hence the chunks. We then chose RDB for metadata because it is closer to our online representation (a Redis database).

@richarms
Copy link

richarms commented Mar 5, 2019

Slightly tangentially: how might one write out a .ms file** from a katdal.VisibilityDataV4 object that has undergone a select operation on it? i.e.:

d=katdal.open('1599998888_sdp_l0.full.rdb')
d.select(timerange=('2019-03-05 22:00:00', '2019-03-05 23:00:00'), freqrange=(1400e6, 1420e6), scans='track')
d.toms('msname.ms') !??

** why would I want to do this? ILUFU pipelines currently need .ms file inputs. I imagine other things might too.

*** I know that mvftoms.py exists, and can use it. However, this route is not ideal since it does not have the range of select operations available to it, and, I imagine, is also similarly unoptimised.

@ludwigschwardt
Copy link
Contributor

That is a long-standing dream of mine :-) The short-term solution is to hack select into your own copy of mvftoms.py.

There is some thought required on how to handle calibration and averaging (pre or post selection), but I think this is a worthy pursuit. Maybe make a ticket :-)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants