Running Jupyter Notebooks with Amazon SageMaker

WINW > DevOps > Amazon > Running Jupyter Notebooks with Amazon SageMaker

The primary service you would use for periodic, automated execution of an ML workflow within the SageMaker ecosystem is SageMaker Pipelines or, for simpler scheduling, a combination of Amazon EventBridge (the scheduler) and SageMaker Processing Jobs or Training Jobs (the executor).

Here is a breakdown of the three best ways to achieve this, from simplest to most robust:


1. ⚙️ Simplest Approach: EventBridge + Training/Processing Jobs

This is the easiest way to schedule a single script execution.

  • SageMaker Training Job: Used if your Python script’s primary function is to train an ML model (i.e., it defines a model, loads data, and runs a fitting process).
  • SageMaker Processing Job: Used if your Python script’s primary function is data preparation, feature engineering, or model evaluation (tasks that don’t involve training/fitting a model).
  • Amazon EventBridge (formerly CloudWatch Events): This acts as the scheduler.

How it works:

  1. Package Your Script: Ensure your notebook’s Python code is saved as a clean Python script (.py file).
  2. Upload to S3: Upload your script and any required data to an Amazon S3 bucket.
  3. Create an EventBridge Rule:
    • Set the Schedule: Use a cron expression (e.g., cron(0 12 * * ? *) for noon UTC daily) or a fixed rate (e.g., rate(1 day)).
    • Set the Target: Point the rule to invoke the SageMaker service, specifically starting a Training Job or Processing Job.
    • Pass Parameters: In the EventBridge rule’s input transformer, you specify the necessary parameters for the SageMaker job, such as the path to your script in S3, the instance type, and the output location.

2. 🧱 Best for MLOps: SageMaker Pipelines

For full MLOps automation, repeatability, and tracking, SageMaker Pipelines is the recommended service. It allows you to define a multi-step workflow.

How it works:

  1. Define the Pipeline: You define your workflow (e.g., Data Prep $\rightarrow$ Training $\rightarrow$ Model Evaluation $\rightarrow$ Conditional Deployment) programmatically using the SageMaker Python SDK. Each step is a SageMaker construct (e.g., a ProcessingStep for data prep, a TrainingStep for model training).
  2. Upload the Definition: The compiled pipeline definition is uploaded to SageMaker.
  3. Schedule the Pipeline: You use the exact same scheduling method as above: Amazon EventBridge.
    • Set the Schedule (cron/rate).
    • Set the Target: Point the rule to the SageMaker service and specify the action to StartPipelineExecution for your defined pipeline.

Benefit: This provides a repeatable, traceable workflow where you can track artifacts (models, datasets) and easily revert to previous runs.


3. 🌐 Alternative: Lambda + SageMaker

You can use a simple AWS Lambda function as an intermediary to give you more control.

How it works:

  1. Schedule Lambda: Use Amazon EventBridge to schedule the AWS Lambda function periodically.
  2. Lambda’s Role: The Lambda function, written in Python, uses the AWS SDK (Boto3) to programmatically call the SageMaker API (sagemaker.create_training_job() or sagemaker.create_processing_job()).
  3. Start Job: The Lambda execution triggers the creation and start of the desired SageMaker job.

Benefit: This gives you maximum flexibility to perform pre-run checks, dynamic parameter selection, or complex logging before launching the SageMaker job.

Leave a Reply