How to scrape a website that requires login first with Python Scrapy-framework (Filling login forms automatically)
We often have to write
spiders that need to login to sites, in order to scrape data from them. Our
customers provide us with the site, username and password, and we do the rest.
The classic way to
approach this problem is:
1.
launch a browser, go
to site and search for the login page
2.
inspect the source
code of the page to find out:
I.
which one is the login
form (a page can have many forms, but usually one of them is the login form)
II.
which are the field
names used for username and password (these could vary a lot)
III.
if there are other
fields that must be submitted (like an authentication token)
3.
write the Scrapy
spider to replicate the form submission using FormRequest
Being fans of
automation, we figured we could write some code to automate point 2 (which is
actually the most time-consuming) and the result is login form, a library to automatically fill
login forms given the login page, username and password.
Here is the code of a simple spider that would
use loginform to login to sites automatically.
Comments
Post a Comment