Data Engineering Over Simplified Part 2(S3)
Before getting into AWS Analytics services, in this section we will take care of the pre-requisites such as users, roles and policies for S3.
Goal: We will not be covering all the policies that are required for a typical project at this time. We will start with an s3 bucket, group as well as user, role etc using AWS Web Console. Once we create the users and roles, we will setup AWS CLI. We will configure AWS CLI using a profile and then validate whether we are able to read from relevant s3 Bucket or not. Before going further, you need to ensure that you have a valid AWS Account. I would recommend using a personal account.
Let us get an overview of s3.
S3 stands for Simple Storage Service.
It is a low-cost cloud-based permanent storage that is accessible from anywhere based upon the permissions.
Create an S3 Bucket using AWS Web Console
It will be used to store GitHub Activity Data.
- Go to AWS Web Console and go to s3. Create a new bucket by name ittv-github. We can also create folders for landing and raw zones for our data. The landing zone will be used to ingest data from external sources. We will store data in the landing zone using JSON. Typically, data in the landing zone will be deleted. It will act as a scratch pad, and we can delete data that is older than 30 days or as per the SLAs. The raw zone will be used to store data from sources following our data lake standards. In our case we will use parquet as target file format and partition all the data on a daily basis. We will have the data in the raw zone up to 7 to 10 years in most of the cases as part of data lakes.
Overview of Roles
Let us also get an overview of AWS IAM Roles. They are typically associated with services such as EC2, EMR, Glue etc to get permissions on other services.
For a Data Engineering or Data Lake project on AWS, we might have to use multiple services such as Glue, Kinesis, EMR, Athena etc. The services typically interact with other services and the standard way to grant required permissions is by creating IAM Role. We can create roles using IAM Web Console. Similar to Users or Groups, roles also inherit permissions via policies.
[Instructions and Code] Create and Attach Custom Policy
Let us create a custom policy and provide required permissions on the s3 bucket created earlier.
We grant permissions to roles or to users via groups by attaching policies with them. We can either use AWS Predefined Policies or create custom ones. In this case we will create a custom policy by name ITVGitHubS3FullPolicy by adding a custom definition. For this, go to IAM-> Policies →Create Policy
Let us attach a policy via group.
- Go to the ITVGitHubGroup and attach the policy ITVGitHubS3FullPolicy.
Once attached, the user who inherits these policies is ITVGithubUser.
Let us validate the ability to manage objects using AWS CLI
Let us validate the IAM policy by using the user we created & issuing an AWS S3 command via the CLI.
So first I downloaded AWS CLI from the internet and installed it Installing or updating the latest version of the AWS CLI — AWS Command Line Interface (amazon.com)
Next, I opened my command prompt and typed “aws-h” just to make sure CLI is working. Next, we will configure a profile with access token & secret key of the user that we created. This is so that any AWS calls and reference this profile and leverage the credentials that are stored in this profile. So I type”aws configure — profile itvgithub”. I enter my Access key and Access key ID and press enter to the remaining prompts.
Now I type” aws s3 ls ittv-github — profile itvgithub” and then Woohoo we can see our objects. This means we have the permissions to control this bucket using CLI.
Now let us create a bucket and put some folder s & files. To do this you can visit my github profile abhinayasridharrajaram/Data-Eng: Data Enf Concepts (github.com).
S3 Bucket- Points to Note
S3 Bucket name has to be unique globally and it is owned by the account that created it.
Upload all the 6 folders from the retail_db data set which you might have set up earlier. Using AWS Web Console, you can upload only one file at a time.
S3 uses object storage (not file storage).Just drag & drop all these files into your S3 using the screenshots below.
Next, let us take care of versioning control and set lifecycle rules so that older versions are managed.(Otherwise you will unnecessarily be paying for storage) Note that the versioning can only be enabled at the bucket level.
Why do we need this? Just to recover recent versions if you accidently delete your stuff.
Go to your bucket →Properties →Bucket Versioning →Edit → Enable(see screenshot on the left below)
Now let us go to Management →set Lifecycle rules to delete previous versions older than 3 days.(see screenshot on the right) We want to limit the rule to files with prefix retaildb so lets set all these rules.
To summarize versioning, here is what you can do:
Once versioning is enabled we can go to Management and add Life cycle rules. We will add a basic life cycle rule to delete previous versions older than 3 days.
1. Clicked on Management then click on Create Lifecycle Rule. →Name: Archive Old Retail Files
2. We can filter data by Prefix or Tag or both. In our case, we will filter by Prefix — enter retaildb. The prefix is nothing but the beginning string of the Object Key.
3. Under Lifecycle rule actions, choose Permanently delete previous versions of objects. See above screenshot.
4. Entered 3 for the Number of days after objects become previous versions. See above screenshot.
5. Click on Create rule to create the rule.
AWS S3 Cross-Region Replication for fault tolerance
Let us go through the details about the Cross-Region Replication of s3 Bucket or objects within s3 Bucket.
- In some extreme cases, our S3 might not be accessible within a specific region due to unforeseen circumstances which might impact data centers in AWS Region or AZ within AWS Region.
- By enabling Cross-Region Replication we can have a copy of the s3 bucket or objects within the bucket in some other Region.
Let’s enable Cross-Region Replication of our bucket dg-retail.
1) Login to AWS Web Console and go to S3 Management Console.
2) Create another bucket in another region by name rettail-copy. While creating make sure it is any other region but the N.Virginia(for cross region replication to hold true)
3) Create a role by the name AWSS3FullAccessRole with AmazonS3FullAccess policy. For this go to IAM →Role →Choose S3 →Find AmazonS3FullAccess →Create role and name it “AWSS3FullAccessRole”
4) Now we will go to the original bucket which is rettail & configure the replication rule. For this go to Management →Replication rules →Create replication rule →Replication rule name: Retail replication →Status: Enabled →Choose a Rule Scope: Choose to Limit the scope of this rule using one or more filters → Prefix: rettail (all objects under rettail will be replicated)
5) Enter bucket name: dg-retail-copy
6)Make sure to enable versioning on the destination bucket.
7)Make sure to choose AWSS3FullAccessRole and click on Save.
Note: Only new files added with Prefix defined in the filter will be replicated. Existing files will not be replicated.
Storage Class (Change if needed)
Let us get the details about low-cost storage called Glacier within s3.
- The glacier is a low-cost tier within s3.
- We can use Glacier either to manage older versions or replicas for backup.
- Here are the most common ways in which we can set storage class as Glacier.
- Edit the object or folder to use Glacier.
- Configure Glacier as part of lifecycle management to move older versions to Glacier.
- Configure Glacier as part of defining the Cross-Region Replication rule.
Using AWS CLI for S3
Just do “pip install awscli” if you haven't already.
To know what folders exist in your bucket use the below commands “aws configure” ,enter your credentials and then type “aws s3 ls s3://rettail” and you will now see the folder called retaildb. To quit anytime you need to type “q;”
Similarly, if you type “aws s3 ls s3://rettail — recursive”, you can see all objects. To know object size etc, you can type “aws s3 ls s3://rettail — summarize”. Note that command mb is used to create bucket & rm to remove. To know more about commands you can type mb help.
Summary of Important commands under s3:
Listing objects and folders — ls
Copying files — cp
We can use cp to copy the files from the local file system to s3, s3 to the local file system as well as s3 to s3.
Moving objects or folders — mv
Deleting objects or folders — rm
Creating bucket — mb
Removing bucket — rb
Conclusion: Performing the below tasks to gain comfort using CLI to manage objects in S3.
- List the folders in the rettail bucket created earlier. It is recommended to list the objects recursively to review all the objects. Answer: “aws s3 ls s3://rettail — recursive”
- Delete the folders in the retaildb main folder from the bucket created earlier. Answer: aws s3 rm s3://rettail/dgretail/
- Go to AWS Web Console and confirm that folders and objects in the retaildb folder within the rettail bucket are deleted.
- To Copy a file from your local computer and insert it into S3, use the below command
To copy the folder with all its files type this into your Powershell
Here is how you can delete the entire folder(left screenshot) & file(right screenshot)
Here is how you can copy the contents from your source folder(exclude some files from source folder) & paste it to destination S3.