3GCII Software Discussion
Packages:
- CASA (used with KAT, EVLA, ALMA, ATCA)
- MIRIAD
- NEWSTAR (WSRT)
- AIPS (VLA)
- BBS and related software (LOFAR)
- ASKAPsoft (ASKAP)
- Difmap
Tensions:
- Reduce on laptop ←→ supercomputer-scale data processing
- In-house development ←→ fully productized software
Peter's Summary of Discussion Topics, Based on Manu's Notes
Most packages don't have very good facilities for making and manipulating sky models, while this is really the most important part of the calibration process.
Everyone tries to write their own flagging package, even though this always turns out to be more work than you think it should be. The lack of sharing is not at all unusual, but still not good.
Large data volumes require realtime processing and remote-friendly data reduction, e.g., log in to a server and reduce from afar, rather than download data to your laptop and work there. Makes life easier in several ways, though at the moment GUIs are usually annoyingly sluggish over most network connections.
Shared usage of casacore generally considered a good thing. Instrument-specific tools built on top of a common layer with a common data format.
Professional software developers needed, but transparent communication between them and scientists very important.
Data formats very important. Probably not possible / desirable to have a single software package that does everything that everyone could ever want, but plausible to have a single data format that every package can build on. If the data are well-defined, you can always swap out software if you need to; if a data format can only reasonably be used by one package, you can never leave it.
Large data volumes imply that shared data formats should ideally be not only be disk-based, but stream-compatible as well; the most expensive operation is to iterate over your dataset. (This fact also has scary implementations for modern imaging algorithms that need to move back and forth between image and visibility domains.) It will be much faster if multiple packages can be strung together in a pipeline, rather than invoked sequentially on a bunch of mutated datasets. (And it will also require a lot less disk space.)
The Measurement Set specification needs some love. People seem to be converging on this as a data format for interferometry data (despite some limitations), so getting MS to be well-specified is important.
In retrospect, maybe using an industry-standard container format like HDF5 would have been a better choice, but the hard part is specifying the semantics; using HDF5 to contain MS would really only save you writing some I/O code and a few low-level manipulation routines.
Duplication of effort sucks, but so does seeing something weird happening and having no other package available to check whether the problem is a bug or something new in the data.
Good practice for observatories to have a site-specific archive format, and then a "filler" that translates into a standard format such as MS, preferably on-the-fly in a streaming mode, for some use cases. You'll always be thinking of ways to improve your export into the reduction-friendly format. Think of correlator output as one of many telemetry streams, and the reducible dataset is built from these telemetry streams.
Important to be able to easily obtain laptop-sized datasets from an observation, even if such datasets represent only a small fraction of the total observation. Something that you can download relatively quickly and process relatively quickly, on your laptop, so you get a good sense as to what the data are like and how well the observation went.
Peter's Takeaway
I think the most important idea to spread is that building the software ecosystem is much, much easier if everyone agrees on a data format. MS isn't perfect but it has basically become the standard. Given that, it's important to make the written MS specification as excellent as possible.
With datasets getting bigger and bigger, processing time becomes dominated by I/O on the data. So if we want to be able to process data using multiple tools, it's important to be able to stream data between them. At the least, if your reduction steps involve taking an input dataset and writing a slightly-mutated copy of it, you'll run out of disk space quickly. Mutating a single dataset in sequence is dangerous for reproducibilitiy.
With the growth of datasets and reduction packages typically being a pain to install, it's less and less feasible to reduce data on one's laptop using locally-installed software. Good remote access is the key -- currently not too much of a challenge for text console interactions, but graphical tools fall down. Maybe it's not so far-fetched to build a web interface to one of these tools?
Brad's Takeaway
Data Format Standards
The MS is the latest unofficial “standard” data format that telescopes deliver to astronomers. However, there’s lots of work getting different packages to read/write to MS’s from different measurement set. NRAO has a very detailed set of MS definitions (http:// casa.nrao.edu/Memos/229.html). I think that there should be a centralized group (probably coordinated by SPDO?) that allows for more universal definitions of the MS, as well as to coordinate efforts to develop per-telescope fillers if so required.
Different Astronomer Users
In the next 5-10 years, radio telescope projects are going to include small projects (e.g. < 24hrs) as well as large surveys (>100hrs). Can we use the same software to calibrate these observations, or are their needs different? Different telescopes are going to develop different pipelines; should telescopes be pipeline driven, or can we provide large datasets for people which they can reduce with CASA? There should be a clearer definition of the various types of users at telescopes, and we should start thinking about whether we can provide a consolidated system of software for all types of users.
Web-based interfaces for calibrations
Motivated by Mike Sipior. Perhaps software and reduction should be provided via a web-based interface, so that they don’t have to carry their data-sets and heavy hard-to-install software. I personally think that this is a great idea for observatories to implement.
