LSST data science overview

This page provides a broad overview of data science challenges arising from analysis of the Vera C. Rubin Observatory’s planned Legacy Survey of Space and Time (LSST). This information is intended to be accessible to data scientists outside of astronomy (and to be useful to astronomers new to LSST).

Key resources at LSST.org

Basic survey information

Key Numbers | Rubin Observatory — A collection of basic observatory and survey statistics, including observatory, telescope, camera, and filter bandpass specifications, and the sizes and schedule for LSST datasets. An excerpt:

Dataset properties:

Data Releases:

The Rubin Observatory science community

Two organizations provide broad support for the Rubin science community (see our Proposal Support page for potential sources of financial support for individual investigators and small teams):

The Rubin Observatory is located on the Atacama plateau in Chile, and Chile contributes significantly to the project. Wikipedia has a entry on the Vera C. Rubin Observatory (Wikipedia) with a helpful overview of the project’s history and funding.

Data rights: Professional astronomers in the US and Chile automatically have rights to access officially released LSST data. Many other scientists have obtained data rights via in-kind contributions to the LSST community (negotiated with the Rubin Observatory project). For more information about data access, see RDO-013 — Rubin Data Policy (also at the shorcut URL, LS.ST/RDO-013; see LSST URL Shortener for instructions on accessing LSST official documents via LS.ST, including past versions).

Science collaborations: Hundreds of scientists pursuing LSST-related research have self-organized into eight Science Collaborations (SCs), including the ISSC. The SCs are officially recognized by the project and by LSSTC. The SC scientific areas are:

Some SCs require data rights for membership; others are open to scientists who may not have data rights (the case for the ISSC).

Community: The project maintains an online forum for discussion of Rubin-related topics—the Rubin Observatory LSST Community forum, known colloquially as “Community.” Everyone is welcome to obtain a Community account to access Rubin news and participate in discussion and Q&A; data rights are not required.

Glossaries of technical terms and acronyms

The Rubin Project maintains a helpful glossary of technical terms and acronyms associated with Rubin/LSST: Glossary & Acronyms. Rubin’s Education and Public Outreach (EPO) site hosts a simplified glossary: Rubin Education Glossary.

As examples of what you’ll find in the glossary, note the specific ways that “object” and “source” are used in LSST publications:

LSST data challenges in brief

LSST data products

LSST data will be shared via two mechanisms, operating on different time scales/schedules:

For more details about the data products, see: Data Products | Rubin Observatory.

Colocated processing with the RSP: With petabytes of image data, and catalogs with data from tens of billions of objects, LSST is clearly a “big data” enterprise. Due to the scale of the data releases, it is not practical for scientists to download full copies of image or catalog databases for local analysis. The project maintains the RSP to enable data analysis colocated with the data. The RSP provides three aspects for interacting with LSST data:

A spectrum of data science scales

It’s important to note that not all data challenges astronomers will face when using LSST will be big data problems. LSST science will require statistical and informatics innovation across a wide variety of scales.

The ISSC’s founding proposal summarized some problems astronomers will face, distinguishing three scales for the number of objects or sources (“entities”) being analyzed (each possibly with many attributes). The three scales differ by factors of ~$10^3$, roughly delineating regimes where computational resource constraints lead to fundamental shifts in how one thinks about data analysis problems.

Roughly speaking, kiloscale problems will raise challenges that are essentially statistical: devising methods for handling novel types of data and models that optimally extract information from LSST data and provide careful uncertainty quantification. Gigascale problems will raise challenges that are more essentially computational, in the realm of informatics and machine learning: finding algorithms that make it possible to do specific, relatively straightforward tasks across enormous databases; sophisticated statistical modeling may not be possible without significant approximations—though simulation-based inference and emulator technology offer promise to make sophisticated modeling feasible at this scale. Megascale problems occupy a middle ground, where significant innovation may be required on both the statistical and computational fronts.

Many important problems have a hierarchy of scales. A key example is nightly pipeline processing: localized image processing is needed to identify sources and associate them with previously detected objects; but this apparently kiloscale problem must be executed for billions of sources per night, making the overall problem gigascale and constraining the level of statistical sophistication. Similarly challenges arise in nightly alert processing by event brokers, and several brokers provide colocated processing for scientists to develop customized classifiers and other algorithms.

Representative analysis tasks

Here we identify some representative data analysis tasks, many occurring across a range of the scales delineated above.

Discovery — These tasks address detecting and identifying astronomical objects, structures, and events:

Modeling & Analysis — These tasks involve analysis of images of known sources, or of catalogs of classified objects or events:

Data visualization — All areas of LSST research will also need advanced data visualization techniques. For kiloscale problems, “Grand Tour” movies of rotating datacubes with interactive brushing and classification are useful. For megascale and gigascale problems, quantile contour maps and shaded density maps must be rapidly produced from portions of the data. Shaded parallel coordinate maps with brushing may also be powerful interactive visualization tools for multivariate data with more than three dimensions.

The list above is hardly exhaustive, neither in the tasks listed nor in the applications cited. Incomplete though it may be, it already makes clear that LSST poses an almost dizzying variety of research-level data and science analysis challenges. It also makes clear that there are many opportunities to share expertise and research resources across applications, science collaborations, and disciplines.