LSST data science overview

This page provides a broad overview of data science challenges arising from analysis of the Vera C. Rubin Observatory’s planned Legacy Survey of Space and Time (LSST). This information is intended to be accessible to data scientists outside of astronomy (and to be useful to astronomers new to LSST).

Key resources at LSST.org

Basic survey information

Key Numbers | Rubin Observatory — A collection of basic observatory and survey statistics, including observatory, telescope, camera, and filter bandpass specifications, and the sizes and schedule for LSST datasets. An excerpt:

Dataset properties:

Nightly data size: 20 TB/night
Final database size (DR11) : 15 PB
Real-time alert latency : 60 seconds

Data Releases:

Survey duration : 10 years
Number of Data Releases (DRs): 11
Number of objects (full survey, DR11):
20 billion galaxies
17 billion resolved stars
6 million orbits of solar system bodies
Average number of alerts per night: about 10 million

The Rubin Observatory science community

Two organizations provide broad support for the Rubin science community (see our Proposal Support page for potential sources of financial support for individual investigators and small teams):

The Rubin Observatory Project (aka “the project”) is the organization responsible for construction of the observatory, the raw data processing pipeline (“the DM [Data Management] pipeline”), and the community-facing data access and analysis environment, the Rubin Science Platform (RSP). It is also responsible for ongoing operation and maintenance of the observatory and computing infrastructure (“operations”). The web home of the project is at LSST.org and RubinObservatory.org (an alias for the LSST.org site). The project is mainly funded via the National Science Foundation (NSF) and the Department of Energy (DOE) in the US. The observatory’s main telescope is named the Simonyi Survey Telescope, honoring philanthropic funding from the Charles and Lisa Simonyi Fund.
The LSST Corporation (LSSTC) “is an independent non-profit that is transforming the long-term scientific and societal impact of Rubin LSST, through innovative programs that emphasize networks, students, and community.” LSSTC played in a key role in initiating the Large Synoptic Survey Telescope project (which eventually became the Rubin Observatory and the cleverly-backronymed LSST survey). Its activities are funded by membership fees from academic and research institution members, and from philanthropic donations.

The Rubin Observatory is located on the Atacama plateau in Chile, and Chile contributes significantly to the project. Wikipedia has a entry on the Vera C. Rubin Observatory (Wikipedia) with a helpful overview of the project’s history and funding.

Data rights: Professional astronomers in the US and Chile automatically have rights to access officially released LSST data. Many other scientists have obtained data rights via in-kind contributions to the LSST community (negotiated with the Rubin Observatory project). For more information about data access, see RDO-013 — Rubin Data Policy (also at the shorcut URL, LS.ST/RDO-013; see LSST URL Shortener for instructions on accessing LSST official documents via LS.ST, including past versions).

Science collaborations: Hundreds of scientists pursuing LSST-related research have self-organized into eight Science Collaborations (SCs), including the ISSC. The SCs are officially recognized by the project and by LSSTC. The SC scientific areas are:

Galaxies
Stars, Milky Way, and Local Volume
Solar System
Dark Energy (including dark matter and cosmology more generally)
Active Galactic Nuclei
Transients & variable stars
Strong Gravitational Lensing
Informatics and Statistics

Some SCs require data rights for membership; others are open to scientists who may not have data rights (the case for the ISSC).

Community: The project maintains an online forum for discussion of Rubin-related topics—the Rubin Observatory LSST Community forum, known colloquially as “Community.” Everyone is welcome to obtain a Community account to access Rubin news and participate in discussion and Q&A; data rights are not required.

Glossaries of technical terms and acronyms

The Rubin Project maintains a helpful glossary of technical terms and acronyms associated with Rubin/LSST: Glossary & Acronyms. Rubin’s Education and Public Outreach (EPO) site hosts a simplified glossary: Rubin Education Glossary.

As examples of what you’ll find in the glossary, note the specific ways that “object” and “source” are used in LSST publications:

Object: Refers to an astronomical object, such as a star, galaxy, asteroid, or other physical entity. Objects can be static, or change brightness or position with time. Generally an object will be associated with more than one instance of a source detection
Source: A single detection of an astrophysical object in an image, the characteristics for which are stored in the Source Catalog of the science database. The Data Management System attempts to associate multiple source detections to single objects, which may vary in brightness or position over time.

LSST data challenges in brief

LSST data products

LSST data will be shared via two mechanisms, operating on different time scales/schedules:

Nightly alert streams, providing data on transient, variable, and moving objects whose properties have been observed to change on each night. The alerts will be sent in near real-time to a number of officially recognized third-party event brokers who will further process and annotate the alerts in various ways, and serve them to various communities.
Annual data releases (with an extra release halfway through the first year), with two main data products:
- Processed images;
- Catalogs of properties of objects, including time series data comprising measurements of the sources associated with a particular object.

For more details about the data products, see: Data Products | Rubin Observatory.

Colocated processing with the RSP: With petabytes of image data, and catalogs with data from tens of billions of objects, LSST is clearly a “big data” enterprise. Due to the scale of the data releases, it is not practical for scientists to download full copies of image or catalog databases for local analysis. The project maintains the RSP to enable data analysis colocated with the data. The RSP provides three aspects for interacting with LSST data:

A web-based interface for querying image and catalog data.
A Jupyter notebook environment for co-located data access and processing using project-developed Python packages (and user-specific code).
A web API for remote programmatic access to LSST data.

A spectrum of data science scales

It’s important to note that not all data challenges astronomers will face when using LSST will be big data problems. LSST science will require statistical and informatics innovation across a wide variety of scales.

The ISSC’s founding proposal summarized some problems astronomers will face, distinguishing three scales for the number of objects or sources (“entities”) being analyzed (each possibly with many attributes). The three scales differ by factors of ~$10^3$, roughly delineating regimes where computational resource constraints lead to fundamental shifts in how one thinks about data analysis problems.

Kiloscale (up to ~$10^4$ entities) — A fundamental kiloscale LSST problem is analysis of multiband, multi-epoch photometry (brightness measurements) for a single object in the Object Catalog, where the entity is a source measurement. For a single object, these data comprise a multivariate time series with ~100 observations per year (spread across 6 bands, asynchronously and irregularly sampled in time), with ~1000 measurements per object by DR11. Such data will be used to study source variability in several bands, which falls under the large rubric of multivariate time series analysis. A second class of kiloscale problem arises in demographic analysis of catalog data for modest-sized populations; e.g., trans-Neptunian objects (TNOs; ~20k expected), gamma-ray burst (GRB) hosts (potentially ~100 per year), microlensing events, and relatively rare classes of stars, galaxies, or clusters. Kiloscale problems are generally not challenging in data processing or storage, but will benefit from a variety of statistical techniques such as outlier and change point detection, survival analysis (treatment of upper limits), periodicity searches, and statistical inference involving heteroscedastic (different for each sample) measurement errors.
Megascale ($10^5$ to $10^7$ entities) — Megascale problems include demographic analysis of larger populations (e.g., quasars; variable Galactic stars; low redshift galaxies); population-level re-analysis of previous survey catalogs (e.g., from SDSS) supplemented by LSST follow-up; and multicolor, multi-epoch image analysis for extended objects where the entity is a pixel. Megascale problems require processing datasets occupying storage of several megabytes to gigabytes, corresponding to “low-volume” queries against LSST data products. In this regime, statistical methods must be computationally efficient and advanced data visualization techniques are needed.
Gigascale ($10^8$ to $10^{10}$ entities) — A class of gigascale problem using LSST image data is analysis of calibrated image data for a single LSST field or a modest number of fields. A second class, using catalog data, includes demographic analysis of very large samples of stars (~17 billion total expected) or galaxies (~20 billion total expected). A third class includes the search for rare or serendipitous objects, e.g., via anomaly detection approaches. Gigascale problems require very efficient (and thus limited) processing of data streams from datasets occupying terabytes of storage, corresponding to high-volume queries against LSST data products. Several nightly DM pipeline tasks also fall into this category.

Roughly speaking, kiloscale problems will raise challenges that are essentially statistical: devising methods for handling novel types of data and models that optimally extract information from LSST data and provide careful uncertainty quantification. Gigascale problems will raise challenges that are more essentially computational, in the realm of informatics and machine learning: finding algorithms that make it possible to do specific, relatively straightforward tasks across enormous databases; sophisticated statistical modeling may not be possible without significant approximations—though simulation-based inference and emulator technology offer promise to make sophisticated modeling feasible at this scale. Megascale problems occupy a middle ground, where significant innovation may be required on both the statistical and computational fronts.

Many important problems have a hierarchy of scales. A key example is nightly pipeline processing: localized image processing is needed to identify sources and associate them with previously detected objects; but this apparently kiloscale problem must be executed for billions of sources per night, making the overall problem gigascale and constraining the level of statistical sophistication. Similarly challenges arise in nightly alert processing by event brokers, and several brokers provide colocated processing for scientists to develop customized classifiers and other algorithms.

Representative analysis tasks

Here we identify some representative data analysis tasks, many occurring across a range of the scales delineated above.

Discovery — These tasks address detecting and identifying astronomical objects, structures, and events:

Adjusting thresholds to control the number of false detections or associations, taking into account the huge multiplicity of tests that a large survey performs. This is a fundamental DM pipleline task, determining the behavior of nightly detection pipeline processing. Control of the False Discovery Rate (FDR) under multiple testing, and classification within a hierarchical Bayesian framework, are two examples of approaches potentially relevant to optimizing detection thresholds.
Faint source detection in multicolor/multi-epoch data cubes. The DM pipeline will address this task for Data Releases (DRs). It will also arise in re-analyses of image data products that attempt to push the boundary of LSST to the dimmest sources, e.g., for TNO, faint star (e.g., white dwarf) and faint (high redshift) galaxy and AGN studies, particularly using “deep drilling” data (from fields on the sky where LSST will observe with increased frequency). While in many cases, the faintest sources will be found in images merged from many epochs, in cases where observing conditions change or transients are present, the faintest sources may be found in localized portions of the data cubes. Finding these sources is both a statistical and computational challenge.
Classification of objects within a population, including both supervised classification (assigning objects to previously-identified classes) and unsupervised classification (where the number and characteristics of classes are derived from the data), arising in the study of all sizable populations, from minor planets to distant galaxies. A wide variety of multivariate classification tools are available.
Flexible and adaptive study of variability and transients in time series. The detection of short-lived transients or marginal variability in sources is a problem in statistical inference where the heteroscedastic measurement errors due to changing observing conditions will play an important role. Once variability is established, both time domain and frequency domain time series techniques are relevant. A number of algorithms for establishing autocorrelated and/or periodic variations can be applied to the time series in both a single photometric band or all bands simultaneously. Simple nonparametric measures like the partial autocorrelation function or the minimal string length measure may be appropriate for scanning for variability and periodicity in gigascale LSST catalogs.
Cross-matching within and between large catalogs, with accurate accounting for astrometric uncertainties. This task arises in all application areas involving correlative analysis, and becomes particularly challenging if the target catalog has significant direction uncertainties. Current astrostatistics research efforts are addressing this task by marrying directional statistics, product partition models, and Monte Carlo methods for searching the space of likely associations and accounting for multiple comparisons. These procedures can take into account knowledge of the local densities of objects in the catalogs.

Modeling & Analysis — These tasks involve analysis of images of known sources, or of catalogs of classified objects or events:

Image analysis, including multi-frame super-resolution, and multiscale deconvolution with uncertainty quantification. When instrumental or atmospheric distortions cause blurring, likelihood-based deconvolution based on the known point spread function can sharpen the image. Large volumes of image segments can be reduced and characterized using sparse representations such as wavelets, curvelets or compressive sensing. Density estimation techniques like locally weighted regression can reconstruct image regions with reduced noise and with variance images to help adjudicate the statistical significance of features.
Design and comparison of photometric redshift (photo-z) algorithms, including calibration of redshift uncertainties needed for a wide variety of extragalactic research, ranging from luminosity function estimation to dark energy studies using Type Ia supernovae (SNe Ia) and weak lensing. Relevant information science developments are in multivariate density estimation, Bayesian and neural network clustering, and nonparametric regression and density estimation. These are all areas with many relevant recent developments in machine learning and statistics. Also, statistical study can provide guidance for spectroscopic measurements to improve predictions.
Flexible modeling of multivariate distributions for populations, accounting for selection effects including truncation in a single survey, censoring in a follow-up survey, and measurement error in all surveys. This class of task arises in nearly all population modeling applications; e.g., orbit/size/color distributions of minor planets; color-magnitude diagrams of stars; distance estimator calibration for galaxies (Tully-Fisher and fundamental plane relations); luminosity functions of galaxies, AGN, and transient populations. Relevant, active research frontiers in information sciences include semiparametric and nonparametric inference with heteroscedastic measurement error, and nonparametric and parametric survival analysis.
Characterizing multicolor variability. A wide variety of temporal behavior will be discovered and studied by LSST. Examples include: periodic variability of minor planets (to study rotation, geometry, and composition); periodic variations of stars due to rotation and pulsation including Cepheid and RR Lyrae period-luminosity-color correlations for use as low-z distance indicators; characterizing the smooth multicolor light curves of SNe Ia for use as distance indicators in dark energy studies; characterizing the wide range of smooth and chaotic variability across AGN populations. Astronomers have already developed a variety of methods for periodicity searches from unevenly spaced data, although research is needed for incorporation of measurement errors. Relevant frontier information science studies include: nonparametric harmonic analysis; multivariate nonparametric regression with sparse Gaussian processes and PCA-based dimension reduction; and space-time modeling (translated to wavelength-time) with multivariate stochastic processes built over overcomplete bases (for sparse representation).

Data visualization — All areas of LSST research will also need advanced data visualization techniques. For kiloscale problems, “Grand Tour” movies of rotating datacubes with interactive brushing and classification are useful. For megascale and gigascale problems, quantile contour maps and shaded density maps must be rapidly produced from portions of the data. Shaded parallel coordinate maps with brushing may also be powerful interactive visualization tools for multivariate data with more than three dimensions.

The list above is hardly exhaustive, neither in the tasks listed nor in the applications cited. Incomplete though it may be, it already makes clear that LSST poses an almost dizzying variety of research-level data and science analysis challenges. It also makes clear that there are many opportunities to share expertise and research resources across applications, science collaborations, and disciplines.