Categories
Tech

PyWebTest Project

Couple weeks ago, I started a small project to help a friend buying air tickets. My naive imagination told me that, all I need to do is to open the airline’s website, enter some query information, and keep refreshing until a ticket present. It didn’t start out well.

Tech stack

  • Python 3
  • Selenium
  • Chrome driver
  • Chrome browser

I’m new to the web crawling/scrapping area, but I have done some testing with Selenium before. I thought I could just figure out where to click, enter flight information, search, and then keep refreshing.

The problems

Soon I realized that the airline website has a mechanism to block this kind of automation. After finished coding the functionality above, I decided to run my code. The refresh time interval was set to 30 seconds. It went well for about 20 minutes. So I padded myself on the shoulder and thought everything works perfectly and went on doing some housework.

An hour later, I came back and see a Google reCAPTCHA screen. Okay, the airline website knew that I was using a bot to refresh for a new airline ticket. I added some random sleep time on top of my 30-second refresh interval. After another 10 minutes or so, my IP address is blocked. 😶

Solution

The first thing I could think of is using a VPN. So I bought a VPN service so that I could continue my development of the bot. That helped temporarily solved the blocked IP address issue.

Next, I have to solve the Google reCAPTCHA problem, there are 2 ways to go. One is the use some sort of AI library to break it. It’s rather difficult going this route because Google reCAPTCHA is very sophisticated and hard to get by(Otherwise, they’d be out of business already😂). That left me to the second option, fake a user agent header, which is my only alternative for now. A user agent header is a piece of information in your browser send through HTTP/HTTPS request. It contains information of what browser you’re using and what operating system you’re on. The basic principle is to randomly generate a new identity of my browser and trick Google reCAPTCHA that it’s a different device. I found this perfect library called fake-useragent that does the exact thing for me, and it works!

After that, the rest is just code to enter flight information and search. Most of the code can be reused for UI testing and web crawling. So I decided to make part of it open source. Here I started the PyWebTest project.

https://github.com/lokarithm/PyWebTest

Cheers,
Lok

Categories
Tech

A Cheatsheet Of Linux File System And Structure

Although I’m not completely new to the Linux operating system, the file system on Linux is still confusing to me. I watched a video about the Linux file/structure system. It’s time to put down some notes and get familiar with them. Let’s go!

  • /bin – binaries:
    • For programs and applications.
  • /sbin – system binaries:
    • For root/admin users only.
  • /boot – contains everything a system needs to boot
    • e.g. boot loader
  • /dev – devices:
    • Everything is a file in Linux. Hardware like a disk, webcam, the keyboard will be stored here, including their drivers.
  • /etc – etcetera:
    • It stores all the system-wide configuration files.
  • /lib – libraries (includes lib, lib32 and lib64)
    • Libraries of applications
  • /mnt and /mnt – mount:
    • For other mounted drives. You’ll typically see /mnt instead of the/mdeia directory. However, most distros nowadays automatically mount devices for you in the media directory.
    • When you mount a drive manually, use the /mnt directory
    • e.g. external hard drive, USB flash drive, network drive
  • /opt – optional:
    • Manually installed software lives here. Some software packages found in the repo can also be found here.
    • You can also put your own software in this directory.
  • /proc – processes:
    • Pseudo files that contain information about system processes and resources.
    • e.g. A directory that contains information on a running process; information of the CPU, etc.
  • /root – home folder of a root user
    • A directory where only a root user has access.
  • /run:
    • A relatively new directory. It is a tempfs file system. It runs in RAM. Anything in this directory will be deleted after rebooting the system. It is used for processes that start early in the boot procedures to store runtime information.
  • /snap:
    • It stores snap packages(mainly used for Ubuntu). Snap packages are self-contained applications that run differently from other applications.
  • /srv – server:
    • It stores server data. Files that will be accessed by external users. e.g. you set up an FTP server. Files can go here, which is separated from the other files for security purposes.
  • /sys – system:
    • A way to interact with the kernel. This is also a temporary directory created every time the system boot up. Similar to the /run directory.
  • /tmp – temporary:
    • Files that temporarily stored for applications. It’s usually empty after you reboot a system.
    • e.g. temporary files of a word processor application.
  • /usr – user application:
    • Applications installed by a user or for a user only. Applications installed in this directory considered non-essential for basic system operation.
    • Under the /usr directory, there are folders such as lib, bin, and sbin.
    • /usr/local contains software installed from the source code.
    • /usr/shared contains larger software.
    • /usr/src contains installed source code such as kernel header files.
    • Different software or distros may treat these folders differently.
  • /var – variable:
    • Files and directories that are expected to grow in size.
    • e.g. /var/crash contains information about files that are crashed; /var/log contains log files of many applications.
  • /home:
    • Each user has its own /home folder.
    • Storage of your personal files and documents.
    • Each user can only access their own /home folder unless they use admin permissions.
    • It has some hidden directories that start with a dot. e.g. .cache, .config. These hidden folders are used by different applications for their settings. You can see them by using the ls -a command in the terminal.
    • You can back up the hidden directories and restore them in a new system. After reinstalling your applications, the settings will be restored.

Credit/Source of information: DorianDotSlash