Getting Started with pm: A Complete Workflow Example
Alon Alexander
2026-01-21
getting-started.RmdIntroduction
The pm package provides a structured approach to
managing research projects in R. This vignette demonstrates a complete
workflow:
- Creating and configuring a new project
- Creating and preprocessing data
- Saving intermediate results
- Loading data from another analysis
- Using the artifact discovery features
TL;DR
# Create a new project (one-time)
pm <- pm_create_project(project_dir)
# Load an existing project
pm <- pm_project(project_dir)
# Look at available inputs
inputs <- pm$parse_inputs()
# Create a new analysis
prep_analysis <- pm$create_analysis("data_preprocessing")
# Create analysis from path (for example in the analysis's code)
analysis <- pm::PMAnalysis$new(path = ".")
# Get output path for storing artifact
cleaned_data_path <- prep_analysis$get_output_path("cleaned_data", type = "table")
# Read/write data (generic for many file types)
cleaned_data_path$write(my_table)
read_table <- cleaned_data_path$read()
# Find artifact of other analysis
cleaned_artifact <- another_analysis$get_artifact(
"cleaned_data",
analysis_name = "data_preprocessing" # Can also pass NULL to search all analylses
)
# Get intermediate path for caching (local to an analysis)
normalized_path <- prep_analysis$get_intermediate_artifact("noramalized_data")Creating a Project
First, let’s create a new project. We’ll use a temporary directory for this example:
# Create a temporary directory for our project
project_dir <- file.path(tempdir(), "my_research_project")
# Create the project structure
pm <- pm_create_project(project_dir)
pm
#> PMProject:
#> Path: /private/var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T/Rtmpo71dkG/my_research_project
#> Analyses: 0The project has been created with the standard structure:
-
project.yaml- Project configuration file, including portable input definitions -
inputs.local.yaml- local path mappings (machine-specific) -
analyses/- directory for all analyses -
README.md- project documentation
Creating Data
For this example, we’ll create some sample data that simulates a research dataset. In a real scenario, this might come from external sources or experiments.
# Create a temporary location for our "raw" data
data_dir <- file.path(tempdir(), "raw_data")
dir.create(data_dir, showWarnings = FALSE)
# Generate sample data: a simple dataset with measurements
set.seed(42)
raw_data <- data.frame(
sample_id = paste0("S", sprintf("%03d", 1:100)),
treatment = rep(c("Control", "Treatment"), each = 50),
measurement_1 = rnorm(100, mean = 10, sd = 2),
measurement_2 = rnorm(100, mean = 15, sd = 3),
measurement_3 = rnorm(100, mean = 20, sd = 4)
)
# Save the raw data
raw_data_path <- file.path(data_dir, "raw_measurements.csv")
write.csv(raw_data, raw_data_path, row.names = FALSE)
cat("Created raw data file:", raw_data_path, "\n")
#> Created raw data file: /var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T//Rtmpo71dkG/raw_data/raw_measurements.csv
cat("Data dimensions:", nrow(raw_data), "rows,", ncol(raw_data), "columns\n")
#> Data dimensions: 100 rows, 5 columnsSetting Up Project Inputs
Now we need to configure the project to recognize our raw data file. We’ll edit the input configuration files directly in a text editor.
Note: For more information on the different ways to
define project.yaml, see the Input Definitions vignette.
Editing project.yaml
Open project.yaml in your project directory and add the
input definition. The file should look like this:
This file defines the portable input definitions that can be shared across different machines. It describes what the input is, but not where it’s located.
NOTE: This is the simplest format of project.yaml, but
you can also for example provide a description for each input.
Editing inputs.local.yaml
Open inputs.local.yaml in your project directory and add
the path mapping. The file should look like this (with your actual file
path):
Note: Replace the path above with the actual path to your data file. In this example, we’re using a temporary path, but in practice you would use the real path to your data file.
This file maps the input ID (raw_measurements) to the
actual file path on your machine. This file is typically gitignored
since paths differ between machines.
Verifying the Configuration
After editing both files, we can verify the configuration works:
# Reload the project to pick up the changes
pm <- pm_project(project_dir)
pm
#> PMProject:
#> Path: /private/var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T/Rtmpo71dkG/my_research_project
#> Analyses: 0
#> Inputs: 1
# Parse inputs to verify they're configured correctly
inputs <- pm$parse_inputs()
raw_data_input <- inputs$raw_measurements
# Display the input object
raw_data_input
#> PMData:
#> ID: raw_measurements
#> Path: /private/var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T/Rtmpo71dkG/raw_data/raw_measurements.csvAnalysis 1: Data Preprocessing
Let’s create our first analysis for data preprocessing. This analysis will:
- Load the raw data
- Perform some transformations
- Save intermediate results
- Create final preprocessed outputs
# Create the preprocessing analysis
prep_analysis <- pm$create_analysis("data_preprocessing")
prep_analysis
#> PMAnalysis:
#> Name: data_preprocessing
#> Path: /private/var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T/Rtmpo71dkG/my_research_project/analyses/data_preprocessing
#> Project: /private/var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T/Rtmpo71dkG/my_research_projectLoading Input Data
We can load the input data using the project’s
parse_inputs() method:
# Parse and load inputs
inputs <- pm$parse_inputs()
raw_data_input <- inputs$raw_measurements
# Display the input object
raw_data_input
#> PMData:
#> ID: raw_measurements
#> Path: /private/var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T/Rtmpo71dkG/raw_data/raw_measurements.csv
# Read the data (as simple as that!)
raw_data_loaded <- raw_data_input$read()
head(raw_data_loaded)
#> sample_id treatment measurement_1 measurement_2 measurement_3
#> 1 S001 Control 12.741917 18.60290 11.99628
#> 2 S002 Control 8.870604 18.13425 21.33511
#> 3 S003 Control 10.726257 11.99037 24.68530
#> 4 S004 Control 11.265725 20.54545 28.23816
#> 5 S005 Control 10.808537 12.99968 14.49255
#> 6 S006 Control 9.787751 15.31654 15.39658Data Preprocessing Steps
Now let’s perform some preprocessing. We’ll:
- Calculate summary statistics (save as intermediate)
- Normalize measurements (save as intermediate)
- Create a final cleaned dataset (save as output)
Step 1: Calculate summary statistics
summary_stats <- data.frame(
variable = c("measurement_1", "measurement_2", "measurement_3"),
mean = c(
mean(raw_data_loaded$measurement_1),
mean(raw_data_loaded$measurement_2),
mean(raw_data_loaded$measurement_3)
),
sd = c(
sd(raw_data_loaded$measurement_1),
sd(raw_data_loaded$measurement_2),
sd(raw_data_loaded$measurement_3)
)
)
# Save summary stats as an intermediate result
summary_stats_path <- prep_analysis$get_output_path(
"summary_stats",
type = "table",
intermediate = TRUE
)
summary_stats_path$write(summary_stats)
cat("Saved intermediate: summary_stats\n")
#> Saved intermediate: summary_statsStep 2: Normalize measurements (z-scores)
normalized_data <- raw_data_loaded
normalized_data$measurement_1_norm <- scale(normalized_data$measurement_1)[, 1]
normalized_data$measurement_2_norm <- scale(normalized_data$measurement_2)[, 1]
normalized_data$measurement_3_norm <- scale(normalized_data$measurement_3)[, 1]
# Save normalized data as an intermediate result
prep_analysis$get_output_path(
"normalized_data",
type = "table",
intermediate = TRUE
)$write(normalized_data)
cat("Saved intermediate: normalized_data\n")
#> Saved intermediate: normalized_dataStep 3: Create final cleaned dataset (remove outliers, add derived columns)
cleaned_data <- normalized_data
cleaned_data$total_measurement <-
cleaned_data$measurement_1 +
cleaned_data$measurement_2 +
cleaned_data$measurement_3
# Remove outliers (beyond 3 standard deviations)
cleaned_data <- cleaned_data[
abs(cleaned_data$measurement_1_norm) < 3 &
abs(cleaned_data$measurement_2_norm) < 3 &
abs(cleaned_data$measurement_3_norm) < 3,
]
# Save as final output
cleaned_data_path <- prep_analysis$get_output_path(
"cleaned_data",
type = "table"
)
cleaned_data_path$write(cleaned_data)
cat("Saved output: cleaned_data\n")
#> Saved output: cleaned_data
cat("Final dataset:", nrow(cleaned_data), "rows\n")
#> Final dataset: 99 rowsListing Outputs
We can list all outputs from this analysis:
# List final outputs
outputs <- prep_analysis$list_outputs()
cat("Final outputs:\n")
#> Final outputs:
for (output in outputs) {
cat(" -", output$id, ":", basename(output$path), "\n")
}
#> - cleaned_data : cleaned_data.parquet
# List intermediate outputs
intermediates <- prep_analysis$list_outputs(intermediate = TRUE)
cat("\nIntermediate outputs:\n")
#>
#> Intermediate outputs:
for (intermediate in intermediates) {
cat(" -", intermediate$id, ":", basename(intermediate$path), "\n")
}
#> - normalized_data : normalized_data.parquet
#> - summary_stats : summary_stats.parquetAnalysis 2: Statistical Analysis
Now let’s create a second analysis that uses the preprocessed data from the first analysis. This demonstrates how analyses can depend on each other.
# Create the statistical analysis
stats_analysis <- pm$create_analysis("statistical_analysis")
stats_analysis
#> PMAnalysis:
#> Name: statistical_analysis
#> Path: /private/var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T/Rtmpo71dkG/my_research_project/analyses/statistical_analysis
#> Project: /private/var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T/Rtmpo71dkG/my_research_projectUsing get_artifact to Find Data
The get_artifact() method allows us to find outputs from
other analyses without needing to know the exact file path or format.
This is one of the key features of the package.
Method 1: Searching With Explicit Analysis Name
If we know which analysis produced the artifact, we can specify it:
# Get the cleaned data from the preprocessing analysis
cleaned_artifact <- stats_analysis$get_artifact(
"cleaned_data",
analysis_name = "data_preprocessing"
)
cleaned_artifact
#> PMData:
#> ID: cleaned_data
#> Path: /private/var/folders/0t/mvk3x4hx0pl31l5lcl11krcc0000gn/T/Rtmpo71dkG/my_research_project/analyses/data_preprocessing/outputs/cleaned_data.parquetMethod 2: Search All Analyses (When Unique)
If the artifact ID is unique across all analyses (very recommended
anyway!), we can search all analyses by explicitly passing
NULL:
# Since "cleaned_data" only exists in one analysis, we can search automatically
# Note: must explicitly pass NULL to search all analyses; default is current analysis
cleaned_artifact_auto <- stats_analysis$get_artifact("cleaned_data", analysis_name = NULL)
# Both methods return the same artifact
identical(cleaned_artifact$path, cleaned_artifact_auto$path)
#> [1] TRUEMethod 3: Get from Current Analysis (Default)
By default, get_artifact() searches in the current
analysis. This is useful when you want to get an artifact from the same
analysis:
# Get artifact from current analysis (default behavior)
# This would look for "cleaned_data" in "statistical_analysis"
# (In this example, it won't find it since it's in "data_preprocessing")Using get_artifact() on the Project Object
You can also call get_artifact() directly on the project
object (pm). The result is identical to calling it on an
analysis object:
# Use get_artifact() on the project object to find "cleaned_data"
cleaned_artifact_from_project <- pm$get_artifact(
"cleaned_data",
analysis_name = "data_preprocessing"
)
# Check that the artifact is the same as before
identical(cleaned_artifact$path, cleaned_artifact_from_project$path)
#> [1] TRUEThis is convenient if you want to access artifacts globally, without referencing a specific analysis instance.
Using the read() Method
The read() method automatically detects the file format
and reads it appropriately. You don’t need to know whether it’s a CSV,
Parquet, RDS, or any other supported format.
Note: For detailed information on supported file formats and type specifications, see the File Formats vignette.
# Read the artifact - no need to know the file format!
cleaned_data_loaded <- cleaned_artifact$read()
# The data is automatically loaded in the correct format
head(cleaned_data_loaded)
#> sample_id treatment measurement_1 measurement_2 measurement_3
#> 1 S001 Control 12.741917 18.60290 11.99628
#> 2 S002 Control 8.870604 18.13425 21.33511
#> 3 S003 Control 10.726257 11.99037 24.68530
#> 4 S004 Control 11.265725 20.54545 28.23816
#> 5 S005 Control 10.808537 12.99968 14.49255
#> 6 S006 Control 9.787751 15.31654 15.39658
#> measurement_1_norm measurement_2_norm measurement_3_norm total_measurement
#> 1 1.2852880 1.4250020 -1.9572635 43.34110
#> 2 -0.5734950 1.2522317 0.3383886 48.33997
#> 3 0.3174834 -1.0127756 1.1619263 47.40193
#> 4 0.5765053 2.1411438 2.0352826 60.04933
#> 5 0.3569895 -0.6406842 -1.3436351 38.30077
#> 6 -0.1331333 0.2134519 -1.1214097 40.50087
cat("Data loaded successfully with", nrow(cleaned_data_loaded), "rows\n")
#> Data loaded successfully with 99 rowsThis is particularly powerful because:
- You don’t need to remember file paths
- You don’t need to know the file format
- You don’t need to retype/google search how to read different file types every time
- If the format changes (e.g., from CSV to Parquet), your code still works
Performing Statistical Analysis
Now let’s perform some statistical analysis on the cleaned data:
# Perform t-test comparing treatments
control_data <- cleaned_data_loaded[
cleaned_data_loaded$treatment == "Control",
"total_measurement"
]
treatment_data <- cleaned_data_loaded[
cleaned_data_loaded$treatment == "Treatment",
"total_measurement"
]
t_test_result <- t.test(control_data, treatment_data)
# Create results summary
results_summary <- data.frame(
test = "t-test: Control vs Treatment",
statistic = t_test_result$statistic,
p_value = t_test_result$p.value,
control_mean = mean(control_data),
treatment_mean = mean(treatment_data),
control_sd = sd(control_data),
treatment_sd = sd(treatment_data)
)
# Save results
stats_analysis$get_output_path("statistical_results.csv")$write(results_summary)
print(results_summary)
#> test statistic p_value control_mean treatment_mean
#> t t-test: Control vs Treatment -0.6619564 0.5095986 44.31905 45.01555
#> control_sd treatment_sd
#> t 5.530649 4.913533Understanding Intermediate vs Output Files
It’s important to understand the distinction between intermediate and output files:
-
Intermediate files (
intermediate/folder): Temporary results used within an analysis, typically for caching expensive computations. These can be accessed from the current analysis usingget_intermediate_artifact(id), but are not searchable from other analyses. -
Output files (
outputs/folder): Final results meant to be shared across analyses. These are findable.
Let’s see what’s available in each:
# List what's in outputs/ (searchable via get_artifact)
prep_outputs <- prep_analysis$list_outputs(intermediate = FALSE)
for (out in prep_outputs) {
cat(" -", out$id, ":", basename(out$path), "\n")
}
#> - cleaned_data : cleaned_data.parquet
# List what's in intermediate/ (NOT searchable via get_artifact)
prep_intermediates <- prep_analysis$list_outputs(intermediate = TRUE)
for (int in prep_intermediates) {
cat(" -", int$id, ":", basename(int$path), "\n")
}
#> - normalized_data : normalized_data.parquet
#> - summary_stats : summary_stats.parquetHow to Use Intermediate Results: Example
Intermediate results are useful for caching expensive or time-consuming steps, so you don’t need to recompute them every time. You can check if an intermediate output already exists, and if so, load it; otherwise, compute it and save for next time. Here’s how:
# Define the ID for your intermediate result
intermediate_id <- "filtered_data"
# Get the artifact (will find existing file or return path for new file)
filtered_artifact <- prep_analysis$get_intermediate_artifact(intermediate_id)
# If the intermediate result exists, load it; otherwise, compute and save it
if (filtered_artifact$exists()) {
# Load cached version
filtered_data <- filtered_artifact$read()
} else {
# Compute your intermediate result here
filtered_data <- subset(cleaned_data_loaded, total_measurement > 15)
# Save as intermediate result
filtered_artifact$write(filtered_data)
}By using this pattern, you can make your analyses faster and more robust to changes, by reusing cached results when available.
Working with Multiple Artifacts
Let’s create another output in the preprocessing analysis and
demonstrate how get_artifact handles multiple
artifacts:
# Create a correlation matrix as another output
correlation_matrix <- cor(cleaned_data_loaded[,
c("measurement_1", "measurement_2", "measurement_3")
])
# Save as RDS (object format)
prep_analysis$get_output_path("correlation_matrix", type = "object")$write(correlation_matrix)
cat("Saved correlation matrix\n")
#> Saved correlation matrix
# Now we have multiple outputs in the preprocessing analysis
all_outputs <- prep_analysis$list_outputs(intermediate = FALSE)
cat("\nAll outputs in preprocessing analysis:\n")
#>
#> All outputs in preprocessing analysis:
for (out in all_outputs) {
cat(" -", out$id, ":", basename(out$path), "\n")
}
#> - cleaned_data : cleaned_data.parquet
#> - correlation_matrix : correlation_matrix.rdataReading Different File Formats
The read() method works seamlessly with different file
formats:
# Read the CSV file (cleaned_data)
cleaned <- stats_analysis$get_artifact("cleaned_data", analysis_name = NULL)$read()
cat("Cleaned data type:", class(cleaned), "\n")
#> Cleaned data type: tbl_df tbl data.frame
# Read the RDS file (correlation_matrix)
corr_matrix <- stats_analysis$get_artifact("correlation_matrix", analysis_name = NULL)$read()
cat("Correlation matrix type:", class(corr_matrix), "\n")
#> Correlation matrix type: environment
print(corr_matrix)
#> <environment: 0x11b8da6e8>Summary
This vignette demonstrated:
-
Project Creation: Using
pm_create_project()to set up a structured project - Input Configuration: Setting up inputs using the two-layer system (portable + local)
-
Analysis Creation: Creating multiple analyses with
create_analysis() -
Intermediate Results: Saving intermediate results
using
intermediate = TRUE -
Artifact Discovery: Using
get_artifact()to find outputs from other analyses- With explicit analysis name
- With automatic search when unique
-
Automatic Reading: Using
$read()to automatically load files without knowing their format - Workflow Chaining: Creating analyses that depend on outputs from previous analyses
The key benefits of this approach:
- No hardcoded paths: Artifacts are discovered by ID, not file paths
-
Format-agnostic:
read()handles different file formats automatically - Organized structure: Clear separation between intermediate and final outputs
- Reproducible: All inputs and outputs are tracked and discoverable
- Flexible: Easy to add new analyses that build on previous work
Next Steps
- Explore the
PMProjectandPMAnalysisclasses for more methods - Learn about input validation and schema checking
- Set up your own project structure for your research workflow
- Use the intermediate folder for caching expensive computations
Related Vignettes
For more detailed information on specific topics:
Working with File Formats: Learn about the different file formats supported by
pm, how to useget_output_path()with type specifications, and the power of format-agnostic reading withget_artifact()andread().Defining Project Inputs: Explore the different ways you can define
project.yaml, from simple arrays to detailed object definitions with metadata.