TIL: Simplifying URL Parsing with Python's urlparse Library

2023-06-25 • 1 min read


For quite a while now, I’ve been using Pocket as my go-to read-it-later app. A few months back, I found myself wanting a solution to listen to my saved articles. This led me to explore the text-to-speech feature in the Reader app. Reader gave me the convenience of linking with my Pocket account, synchronizing its inbox automatically with my saved articles.

This system has worked well. I’ve been listening to my saved articles on Reader while continuing to save new articles on Pocket. This lets me easily revisit them later if needed.

There was, however, a small issue. I noticed the count of saved items on these two platforms didn’t match. While this didn’t initially bother me much (as I didn’t want to go through hundreds of articles), curiosity eventually got me to investigate this using Pandas. The main source of discrepancy wasn’t anything major, just a few instances of Reader changing urls from http to https. But, the point of interest here isn’t the mismatch - it’s the Python library that helped me spot it.

Python’s urlparse Library

While looking into this issue, I needed to compare the URLs of the saved items across both platforms. My initial plan was to use regex patterns to split each URL to different parts. However, I stumbled upon Python’s urlparse library. It not only did what I needed but also offered a has many additional functionalities.

Here’s a glimpse of how it works:

from urllib.parse import urlparse

url_to_parse = "https://saeedesmaili.com/exploring-openai-whisper-with-non-english-voices/?param=query"

parsed_url = urlparse(url)

ParseResult(scheme=‘https’, netloc=‘saeedesmaili.com’, path=‘/exploring-openai-whisper-with-non-english-voices/’, params=”, query=‘query=param’, fragment=”)

Looking back, I wish I’d discovered this library sooner. Last year, I spent quite some time with a regex pattern to extract the domain name and extension (the netloc in urlparse’s output) for a side project.

One additional advantage of urlparse is its simplicity in manipulating parts of a URL. Here’s an example: