Webscraping as a side project
A friend and I were looking for a side project to work together. We realized we both faced a similar problem.
Let's use AWS Lambda Python runtime as an example. AWS will send out emails when a version is at the end of life making it difficult to stay on the latest if desired. Plus, reacting to them usually means you are many many versions behind already.
Our journey started. We made a list of providers for the MVP: AWS, GCP and Azure. Then a list of the services that have versions (for example S3 doesn't have versions). After that we realized that we could get some versions using APIs. Other services exclusively require webscraping.
We support 46 services and counting. Take a look at Cloud Outdated and subscribe to get notified. If you are looking for a service that's not there contact us.
Picking a language, framework and platform
We're both Python programmers. The choice was obvious. "Let's use Python and Django framework for the job" we said. We didn't want to spend our innovation tokens on new language/framework. So we chose Boring Technology.
For the db we spent our first innovation token. We decided to go with the flashy new serverless postgres-compatible CockroachDB.
On the hosting side we're using AWS Lambda. Taking advantage of the free compute time. Helps mantaining the costs down.
Make webscraping reliable
A webpage that's being scraped can change at any time. First thing we did was account for those edge cases. We created a custom exception that is triggered when something changed. So that we can react to that downstream.
class ScrapingError(Exception): passWe wanted to keep the implementation simple. Each service is scraped by a single function. The signature of the function is something like aws_lambda_python() -> List[Version]. All the implementations follow a similar pattern:
def aws_lambda_python(): # Read versions from aws docs website: # https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html if not found_versions: raise ScrappingError # Process and return versionsThat's ^ what we call a poll function.
We pass poll functions through a polling class that handles all the errors and results. When we detect an scraping error we have a special case. We send an email with the details of what happened. Because the problem is something that requires manual action. We receive that email in our personal inboxes and fix the problem ASAP.
The poll class that handles all the magic behind cloud outdated is actually very simple:
class PollService: def __init__(self, service: Service, poll_fn: Callable): self.poll_fn = poll_fn # Some other attributes... def poll(self): try: results = self.poll_fn() self.process_results(results) except ScrapingError as e: notify_operator( f"{type(e).__name__} at line {e.__traceback__.tb_lineno} of {__file__}: {e.__str__()}" ) def process_results(self, results): # if results contains new versions: # save new versions to db # if results contains deprecated versions: # set versions in db as depreacted
That's the hearth of Cloud Outdated. After that we have to send notifications to subscribed users. That part is trivial. We send an email that contains the difference between what was last sent to a user and what we have stored in the db at the moment.
Last toughts
Having a side project is usually a good idea. For us has been a journey were we got to know some new stuff (CockroachDB). We also learned about how to build a product and keep a MVP mentality. The most difficult challenge we face is to bring more users to the platform.
We'd love to see more people subscribed. If this blogpost sparked your interest go to Cloud Outdated and subscribe to start getting emails.
See you next time!