Before we start training a machine learning model with Amazon SageMaker, we need to check and organize the pre-processed data from Lambda. This is an important step to ensure the training process is accurate and efficient.
Data is the “fuel” of the ML model. The quality and structure of the input data will directly affect:
Go to Amazon S3 in the AWS Management Console to verify the input data from Lambda:
Select the bucket you created in section 3 – Create S3 Bucket for Data Storage.
Open the processed/ folder – this is where Lambda saved the pre-processed data.
Check if the CSV or Parquet file exists (e.g. data_processed.csv).
📸 Example folder structure:
ml-pipeline-bucket/
├─ raw/
│ └─ data.csv
└─ processed/
└─ data_processed.csv
💡 The data in the processed/ folder is the input that SageMaker uses in the next model training step.
SageMaker expects the input data to be in a specific folder in S3, for example:
s3://ml-pipeline-bucket/processed/train/
s3://ml-pipeline-bucket/processed/validation/
You can organize the data as follows:
📌 For example:
processed/
├─ train/
│ └─ train.csv
└─ validation/
└─ val.csv

Ensure the SageMaker IAM Role has access to the bucket containing the data:
s3:GetObjects3:ListBuckets3:PutObject (if write results are needed)⚠️ If SageMaker does not have permission to read data from the S3 bucket, the training job will fail immediately.
SageMaker supports CSV, Parquet, or RecordIO formats.
If using CSV, ensure:
There are headers describing the columns.
There are no null values or formatting errors.
Numeric features have been normalized (if needed).
Example of a standard train.csv file:
feature1,feature2,feature3,label
0.21,0.75,0.11,1
0.56,0.22,0.65,0
0.34,0.12,0.88,1

You have completed preparing the input data for SageMaker: