Extracting Data From Youtube API

Published in

Python in Plain English

9 min readApr 4, 2023

In this article, I will show you the libraries and techniques that you can use when working with the You Tube API.

Task— API on Public Information

Let us calculate the duration of each video from the below-mentioned playlist(on student clubs) in Google Developer. Then we shall also calculate the total duration of all videos. There are about 61 videos here.

Things you need to Know

You need an API Key →The API key associates the request with a Google Cloud project for billing and quota purposes. Because API keys do not identify the caller, they are often used for accessing public data or resources. When calling APIs that do not access private user data, you can use simple API keys. These keys are used to authenticate your application for accounting purposes. The Google Developers Console documentation also describes API keys.
You need requests library → We will make the call through requests library. We insert the URL of the API in the get() method like this.

import requests
response = requests.get(url).json()

You can notice how we added the json() method to see a JSON object in the response. Typically JSON structure resembles an attribute-value pair or an array.

Prerequisites

Generating an API Key → Go to this link, click on “Create Project” & give it a name. Now enable “YouTube Data API v3” by typing it in the console. You will be prompted to associate this with a project. Go ahead and choose your project. Next, go to Credentials and choose the options below & then click “Next”. AIzaSyDkHl2Yr3Z6ByWahJghcHCPadrSa1kJbz0. Now

Carefully Read up →Navigate the pages starting with https://googleapis.github.io/google-api-python-client/docs/dyn/youtube_v3.html to read on the instance methods associated with this Youtube Service.

Step 1 → Connecting to the service & getting a response

To learn more about how we can build an API-specific service object, make calls to the service, and process the response please scroll to the “Building & Calling service” section here.

We will connect to the service using the build function(that takes in the API name & version & API key) & creates a service object.

Channels return a resource for the API call. Let us take a look at the “list” method that will list the kind of channels. Let us see the arguments that this list method takes. Part is a required parameter within list. Let us scroll down to see what strings we can pass into this part's parameters. I am going to pass the statistics as an argument to see how the channel fared.

Channels: list | YouTube Data API | Google Developers

GET https://www.googleapis.com/youtube/v3/channels A request that retrieves the auditDetails part for a channel

developers.google.co

from googleapiclient.discovery import build
api_key = 'AIzaSyBPif6-d8iBPjIpjU5dA7T5kn0H-Iy3jfM'

service = build('youtube','v3',developerKey = api_key)
request = service.channels().list(part = 'contentDetails, statistics', forUsername= 'GoogleDevelopers')
response = request.execute()
print(response)

Not surprised to see that Google Developers Channel has so many subscribers! Let's grab the Channel ID from here(see screenshot above).

Step 2 →Getting the Channel ID→Inserting Channel ID into the code to get Playlist ID

To do this within the API, we can retrieve one or more playlists by their unique Channel ID.
Now we have the channel ID, I am going to think about getting to the playlists. To get there, I referred to the document page here, see screenshot below and clicked on the link. I searched for the Python code associated and leveraged it to get to the playlists.

# Code to get playlist id using Channel ID
from googleapiclient.discovery import build
api_key = 'AIzaSyBPif6-d8iBPjIpjU5dA7T5kn0H-Iy3jfM'

service = build('youtube','v3',developerKey = api_key)
request = service.playlists().list(part="snippet,contentDetails",channelId="UC_x5XG1OV2P6uZZ5FSM9Ttw",maxResults=25)
response = request.execute()
print(response)

This got details and snippets of the playlist. Looking at the 2nd line, we can see how:

→Output returns a dictionary.

→ Playlist is a list of values within the items key.

→ Look at the third row of the screenshot, each playlist has an id that starts with “PL”, but there is no duration. I am interested in the first playlist id related to student clubs. So, I will use this playlist id and then grab the video id of all videos within that playlist. Video ID should be our final destination since know the video_id will reveal the duration of the video.

Step 3 → Getting Video id using the Playlistid

1. I use the playlist ID from above and rewrite the code to find the video_id . The main change here is that I am modifying the code to loop through the “playlist items “ instead of the “playlists”. You can confirm how this playlist has 61 videos from the last line of the below screenshot.

#Changed the code  to loop through palylist items to get video_id
from googleapiclient.discovery import build
api_key = 'AIzaSyBPif6-d8iBPjIpjU5dA7T5kn0H-Iy3jfM'

service = build('youtube','v3',developerKey = api_key)
request_pl = service.playlistItems().list(part="contentDetails",playlistId='PLOU2XLYxmsIL5MoZ5LrrxfVk3V04evsMm' )
response_pl = request.execute()

print(response_pl)

2. Looping through “items” gives us all kinds of things like kind, etag etc. Since we are only interested in videoid lets only pull that.

Woohoo, we got the videoid of the first five videos of the playlist. We will get the remaining videos in the later section.

3. To restrict the number of calls to the API, let me store all the video_id’s & make a single call.

Step 4→ Using Video ID to parse the duration of the video

To figure out how to use the video to parse the duration, I clicked on the blue (link by videoID) link under the Videos section of the document and executed the code. I saw the duration of the video pop up under “contentDetails” and that got me to the duration that I was looking for. You can click on “Show Code” to see the parameters and try this yourself.

That was me trying to explore the structure and type of code to parse the duration for the first video_id . If you look at the screenshot above more closely, you can see how the “duration” is nested within items →contentDetails →duration

Using all this information, we can easily get the duration in this format (PT4MS) which would mean 30 minutes & 13 seconds. To parse the numbers out of this text, I use regex & finally add calculate the total number of minutes.

#Exploring how to parse duration using the video_id 
vid_request =  service.videos().list(part="contentDetails",id= ",".join(vid_ids))
vid_response = vid_request.execute()
#print(vid_response)
#print()
#print(vid_response['items'])
dur = [ sub['contentDetails']['duration'] for sub in vid_response['items'] ]
#print()
print('durations are', dur)

Using Regex to parse time out of the string

I am going to use regex pattern matching to parse out the total mins. To know more regex, please click on this link . You can research more about groups and pattern matching.

# Showing regex 

import re
nextPageToken = None
hours_pattern = re.compile(r'(\d+)H')
minutes_pattern = re.compile(r'(\d+)M')
seconds_pattern = re.compile(r'(\d+)S')
for item in vid_response['items']:
    duration = item['contentDetails']['duration']
    hours_parse = hours_pattern.search(duration)
    minutes_parse = minutes_pattern.search(duration)
    seconds_parse = seconds_pattern.search(duration)
    #print(duration)
    minutes = int(minutes_parse.group(1)) if minutes_parse else 0
    hours  = int(hours_parse.group(1)) if hours_parse else 0
    seconds  = int(seconds_parse.group(1)) if seconds_parse else 0
    video_minutes = timedelta(hours = hours, minutes = minutes, seconds = seconds).total_seconds
    total_mins = 60*minutes +seconds + 60*60*hours   
    counter = counter +1
    print(f'Video Number {counter} is {total_mins} mins')

You will be able to see only first 50 results of the page.

The API uses the maxResults parameter to indicate how many items should be included in an API response. Almost all of the API's list methods (videos.list, playlists.list, etc.) support that parameter.
If additional results are available for a query, then the API response will contain either a nextPageToken property, a prevPageToken property, or both. Those properties' values can then be used to set the pageToken parameter to retrieve an additional page of results.

 
from googleapiclient.discovery import build
from pprint import PrettyPrinter


nextPageToken = ''

while True:
    pll_request = service.playlistItems().list(part= 'contentDetails', playlistId=  'PLOU2XLYxmsIL5MoZ5LrrxfVk3V04evsMm',maxResults = 50, pageToken = nextPageToken )
    pll_response = pll_request.execute()
    vid_ids = []
    for item in pll_response['items']:
       vid_ids.append(item['contentDetails']['videoId'])
    vid_request =  service.videos().list(part="contentDetails",id= ",".join(vid_ids))
    vid_response = vid_request.execute()
    counter = 0

    for item in vid_response['items']:
     duration = item['contentDetails']['duration']
     hours_parse = hours_pattern.search(duration)
     minutes_parse = minutes_pattern.search(duration)
     seconds_parse = seconds_pattern.search(duration)
     #print(duration)
     minutes = int(minutes_parse.group(1)) if minutes_parse else 0
     hours  = int(hours_parse.group(1)) if hours_parse else 0
     seconds  = int(seconds_parse.group(1)) if seconds_parse else 0
     video_minutes = timedelta(hours = hours, minutes = minutes, seconds = seconds).total_seconds
     total_mins = 60*minutes +seconds + 60*60*hours   
     counter = counter +1
     print(f'It will take a total of {total_mins} mins to watch Video Number {counter}')
    
    
    if 'nextPageToken' in pll_response:
        nextPageToken = pll_response['nextPageToken']
    else:

Step 5 →Calculating total duration of all videos in the playlist by scrolling through Pages

Some API methods may return very large lists of data. To reduce the response size, many of these API methods support pagination. With paginated results, your application can iteratively request and process large lists one page at a time.

pageTokenstring- The pageToken parameter identifies a specific page in the result set that should be returned. In an API response, the nextPageToken and prevPageToken properties identify other pages that could be retrieved.
nextPageTokenstring-The token that can be used as the value of the pageToken parameter to retrieve the next page in the result set.
Link1

# Next page and total one amount 
from googleapiclient.discovery import build
from pprint import PrettyPrinter
sum_total_all_videos = 0 
hours_pattern = re.compile(r'(\d+)H')
minutes_pattern = re.compile(r'(\d+)M')
seconds_pattern = re.compile(r'(\d+)S')

nextPageToken = ''

while True:
    pll_request = service.playlistItems().list(part= 'contentDetails', playlistId=  'PLOU2XLYxmsIL5MoZ5LrrxfVk3V04evsMm',maxResults = 50, pageToken = nextPageToken )
    pll_response = pll_request.execute()
    vid_ids = []
    for item in pll_response['items']:
       vid_ids.append(item['contentDetails']['videoId'])
    vid_request =  service.videos().list(part="contentDetails",id= ",".join(vid_ids))
    vid_response = vid_request.execute()
    counter = 0

    for item in vid_response['items']:
     duration = item['contentDetails']['duration']
     hours_parse = hours_pattern.search(duration)
     minutes_parse = minutes_pattern.search(duration)
     seconds_parse = seconds_pattern.search(duration)
     #print(duration)
     minutes = int(minutes_parse.group(1)) if minutes_parse else 0
     hours  = int(hours_parse.group(1)) if hours_parse else 0
     seconds  = int(seconds_parse.group(1)) if seconds_parse else 0
     video_minutes = timedelta(hours = hours, minutes = minutes, seconds = seconds).total_seconds
     total_mins = 60*minutes +seconds + 60*60*hours   
     counter = counter +1
     sum_total_all_videos += total_mins
    
    #print(' It would take about', round((sum_total_all_videos/60)), 'hours to complete seeing all the videos in the playlist' )
    
    if 'nextPageToken' in pll_response:
        nextPageToken = pll_response['nextPageToken']
    else:
        break
print(' It would take about', round((sum_total_all_videos/60)), 'hours to complete seeing all the videos in the playlist' )

Conclusion

You can use the API to search for videos matching specific search terms, topics, locations, publication dates, and much more. The APIs search.list method also supports searches for playlists and channels. You can retrieve entire playlists, users’ uploads, and even search results using the YouTube API. You can also add YouTube functionalities to your website so that users can upload videos and manage channel subscriptions straight from your website or app.

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.