Data Engineering in AWS Over-Simplified Part-1
For all Data-Eng enthusiasts, welcome!!! By the end of all my articles, you will be able to build Data Engineering Pipelines using AWS Analytics Services such as Glue, EMR, Athena, Kinesis, Lambda.
We will do this by taking small steps each day, so in this article let us go over setting up a development env to interact with AWS first. If you are worried about the set-up, worry no more my friend, I will walk you through all the steps one by one from scratch.
In this article we will broadly cover two topics:
- Set Up Local Development Env for Windows for AWS
- Cloud9 IDE
Setting up Development Env
Step 1: AWS Account: Please start by creating an Account in AWS using this link. Please note: You may end up spending $50- $100 and please exercise caution and do not leave any services running.
Step 2: Setup Development Environment for Windows: We will set up Ubuntu based Virtual machine to interact with AWS. Just go to your search bar and type “PowerShell”(fyi: This is a command prompt for Windows, its awesome because it has the ability to SSH into any machine without installing addtional software, no need to Putty, use the username, hostname etc. This is different from Linux so remember not to mix the commands)
Now let us set up Ubuntu on Windows using something called WSL Windows assistant for Linux.( Note if you are using your personal laptop, then you should be able to successfully implement the below steps)
Type wsl on your screen. You can see that this is already available.
Now type wsl -l — online. You should be able to see the list
Let us install Ubuntu 20.04
- What happened behind the scene is that Virtual machine is now installed, WSL (Windows Assistant for Linux ) is now in place.
- What we need to do now is to reboot so that the installation of the VM using the Ubuntu 20.04 distribution is taken care of. Just restart now.
- Once you log back in you will see this screen which is a WSL Interface a flavor of Powershell. It will be prompting you for a username & password. Set something up for your username & password. By doing that, that you will be created as a user on Ubuntu20.04 VM.
- Now if you type uname -a, you will see that this is nothing but a Linux based operating system
Side Note: If you accidentally log off then just open Power Shell and type “wsl” to connect back in. If your default is not Ubuntu 20.04 then type” wsl -d Ubuntu-20.04". To exit voluntarily just type “exit”. To know current working directory just type “pwd”.
- Your home folder should look like this(Users/ Yourname)and to get all the files, type in “ls -ltr” (A Linux based command to see files/ folder) as shown below
Step 3 → Setting up Python based components for building Python applications on top of Ubuntu
When you cd then you navigate from home directory of Windows into Home Directory of actual User of VM. See below. Once you type python or python3 you will launch Python CLIA.
Woo hoo! Next let us see if we have python virtual env module and PIP module. Couple of things I tried to install but it looks like the list needs to be updated before I install , so lets do that.
I type in “sudo apt update” & then “sudo apt-get install python3-venv”. When prompted press “ Yes” to continue. Wham!! Do the same for pip by typing in “sudo apt-get install python3-pip” and press “Y”. Double Wham!
We are all set to create virtual environment and also install necessary software using pip.
Lets validate by creating a virtual env called “dem-venv” and checking our directory for this . Type in “ python3 -m venv demo-ven” and “ls-ltr “ to check for the same.
Checking in to see if pip is installed
Step 4: Setting up AWS CLI on Windows & Ubuntu using pip
Just type “pip install awscli”. We will exit, login again and check if AWS commands work. ( The checking part is optional)
To do this, see screenshot below:
Step 5: Creating IAM user & downloading credentials
Login into the console → IAM →Users →Add User → Give it a name → Give it Programmatic access →Attach policy →Give Admin access(or any other) → Download CSV
Now as shown above, enter “aws configure” , enter your access key & password & keep pressing enter for default region and output format. This creates a hidden folder called with config file & credentials. You can double check this(optional) by entering “pwd” & “cat .aws/credentials”.
Step 6 → Setting Python Virtual Environment
Let us create a directory, cd into it and create a virtual environment by the name de-venv which will house everything needed, all the python related files that will help us have an isolated environment for this project.
- Just type “ mkdir -p XYZ/whatever/whatever location”
- Then cd in by typing “cd paste your location created before”. You are in now.
- 3. To create your own environment type “python3 -m venv ABCname”.
Sidenote: To know what this folder contains, type “find de-venv”. To see all sub folders type “ls -ltr de-venv”
To activate the virtual environment, please type “ source de-venv/bin/activate”
Step 7: Install Python based SDK to interact with AWS
This is nothing but boto3. Just type “pip install boto3”.
Now we can actually use Python to interact with AWS. See below.
Step 7: Setting up Jupyter Lab
We have fully set up boto3 as a part of this virtual environment. It is time to install Jupyter. Just type “pip install jupyterlab” and then type “ jupyter lab”, copy the URL that appears and view Jupyter through your browser.
See the above highlighted red link, you should be able to view this in the browser.
Now let us see if boto3 is accesible as a part of the Notebook. Create an object by the name s3_client for that. Here is some code that uses boto3 to get details on buckets.
Overall if you want to interact with AWS services using Boto3 with the Python as a programming language, you need to create a client object based on that service. Using that client object you can manage those services or get the information you want. Behind the scenes the credentials under the default profile are being user. That is why we did the AWS CLI steps above.
Setting Up Cloud9
This is the official AWS IDE. You will be able to set up AWs related applications effectively.
Step1: Setting up a User →Set up a User Group called ICloud9 & give it Admin access policy.(Go to User Groups like shown below). Similarly, create a user & give it both Console& programmatic access.
Now go to Dashboard, select URL, open a new window and login using ICloud9 credentials. Creating Cloud9 instances using IAM account is recommended as opposed to Root Accounts. See below
Step2 Create a new env →Once you login using your Cloud9, you will automatically see “Create environment” option on your right. Click on that and lets begin setting it up. I am going to keep the default settings as-is, please refer below.
You will see the IDE after a few moments, to check if you have git just type “git” then for docker you can type “docker ps”, “sudo systemstl status docker” to see if the docker is running. To check if you have Python just type “python” & then exit().
For java you can type “java -version” and “javac -version” for jdk. You can validate if this user can access S3 bucket by typing in “aws s3 ls” You can also say “sudo systemctl status docker” and you will see that the docker is also running. There arent’ any container running but if you wanted to check you could type “docker ps”. Similarly “docker images” to see the images.
Note: How to check your EC2 ? So now, how do you check the EC2 instance running in AWS demo env you just created? To do that click on your env aws demo
Accessing Web applications using Public DNS
1) Opening ports → Go to EC2 → Scroll down to Security Group →Add Inbound rules →HTTP →My IP. Think of Security Group as a firewall within AWS ecosystem.
Also Note: The auto-assigned public IP address associated with Amazon Elastic Compute Cloud (Amazon EC2) instance changes every time you stop and start the instance. (You can see these changes if you go to Instance, instance id and scroll below)
An Elastic IP address is a static public IPv4 address associated with your AWS account in a specific Region. Unlike an auto-assigned public IP address, an Elastic IP address is preserved after you stop and start your instance in a virtual private cloud (VPC). We will be starting & stopping Cloud 9 env frequently through this course so lets not worry about the changing IP addresses and make sure we perform the below steps.
You can associate an Elastic IP address with your EC2 instance at any time by going to EC2 →
Step1) Network& security →Elastic Ip address — ->Keep clicking through all the options & finally give it a name in the name section below.(I have given it the already discussed name”aws demo”.
Step2) Next go to Actions →Associate Elastic Ip address →When you click on Instance in the screenshot below, you will automatically see a prompt for your instance , just choose that and click through associate.
Step3) Go to Cloud9 terminal, check the status of httpd server by typing in “sudo systemctl status httpd”, you can see it is dead. To start it type “sudo systemctl start httpd”
Step 4) Copy Public DNS, go to browse type http:// and paste your ec2….addresd on your browser. Voila
You can see that the web application is accessible using the public IP for DNS
Here are some more pointers on how to use Python
We will begin by creating a directory using “mkdir delab”, followed by changing directory to delab(by typing cd delab). Let us now create a virtual environment within python by typing “python3 -m venv delab-venv”(virtual env is called delab-venv) .
We can now activate it using “source delab-venv/bin/activate” & do “pip install jupyterlab”. Next step is to activate jupyterlabserver & connect to it.
For this type” $ jupyter lab — ip 0.0.0.0" and you will see that it is running on port8888 as that will come up in one of the messages.
To allow communication from your browser to your EC2 instance over the port 8888 which is the port that Jupyter is listening on the VM, we now go to EC2 00> Instance ID -> scroll down and Security Group-> add Inbound Rule and then add Source as MyIp & port range =8888.(By default these Vm’s & ports are not accessible unless you add the rule in the Security Group). And while you are editing the inbound rules, also add rules for allowing communication from your laptop’s IP to connect to the Ec2 instance over port 22 (SSH) & 80 (http).
Next go to, Ec2 →Instance ID → copy the Public Ipv4 , go to your browser & type htttps://+Paste your ec2 id then add “:8888: and press enter
Voila! We can now access Jupyter lab.
You can see how it is asking for a password, this is available in your terminal in the token section already, just paste it from there. See below.
So, to summarize Cloud9 is just an IDE on top of EC2 instance. We have also associated elastic IP with the EC2 instance and this will help us connect to applications that are running on this EC2 instance that is associated with this Cloud9 environment using the same IP address or DNS alias all the time.
Another way to connect to the terminal is to SSH.
There should be a directory called .SSH, in home directory. Type pwd to see the existing path, and use ls to view the contents of the directory. The file that is of interest to us is authorized_keys & it can be found under your home path/.ssh.
To SSH into this EC2 instance from your laptop, we need port 22 to be added in the inbound rules of the VM’s Security Group which we have already done in the previous step . Next we have to update the authorised_keys file with a public key that is generated on your laptop. In order to SSH, we are going to use the public key private key mechanism. Let us now create a key pair in Powershell using ssh-keygen command. Make sure to provide the fully qualified path for the key files, enter a couple of times to avoid creating a passkey and wooho!
Now, you can view the key-pairs (private & public) by listing the contents under your laptops home path/.ssh.
Now open the public key file & copy the contents which need to be pasted under the authorized_key file on the EC2 instance.
Copy from here
Now go to your Cloud9 and type the below. We are going to edit the authorized_key file and add the public key generated on the laptop.
Scroll down using the down arrow key until you reach the last line on the file. Press the character “a” and hit Enter.
If the character “#” gets added automatically then delete it and then paste the contents(below #Add any additional key section) of the public key by right-clicking on the mouse and hitting the paste option(what you copied above)
. To save the file, hit the Escape button and type “:wq” right there
You can validate the updates by running the following command: tail -n 1 authorized_keys).
Go back to powershell and type the below command to ssh into the EC2 instance. You will need the Public IPv4 DNS or IP in the command which can be gathered from the AWS console (found under the networking tab, under Instances by selecting your EC2 instance).
ssh -i [name of the private key] username@[EC2 Instance Name]