Robin is a site that aims to help people how to choose the parts to assemble their computer Robin collects computer data sales sites, returning the most affordable values.
The way we found to obtain all the data from the parts of the computer was through Web-Scraping, which is a form of mining that allows us to extract data from websites, converting them into structured information for later analysis, the framework used to obtain these data was Selenium in Python.
Each site has a specific structure for its data:
Pichau was certainly the site that we had the most difficulty with, the structure of the site changes with each different computer that it is opened, so a way we found to get the data on different computers that we ran the code and it still worked, was using the Socket lib to save a specific structure for each IP.
import socket
hostIP = socket.gethostname() # IP Local
IP = socket.gethostbyname(hostIP) # Specif IP
Another issue we had was the small fade-in effect that is applied from the third product line on the site, so the item images only start to appear in the site's HTML source after you scroll down.
To solve this problem, we use a Selenium command to get the full height of the site and make it automatically scroll down according to the size of the site.
from selenium import webdriver
height = driver.execute_script("return document.body.scrollHeight")
while scroll < height:
driver.execute_script(f"window.scrollTo(0, {scroll});")
scroll += 200
Problems solved, now it's time to get the specification of each part like price, name, etc...
With that in mind, we chose this list of specifications in Pichau:
Especificações | Dados |
---|---|
Preço parcelado | R$ 771,29 |
Preço | R$678,74 |
Nome | MEMORIA TEAM GROUP T-FORCE DELTA RGB, 8GB(1X8GB), DDR4, 3200MHZ, C16, BRANCO, TF4D48G3200HC16C01 |
Link | https://www.pichau.com.br/memoria-team-group-t-force-delta-rgb-8gb-1x8gb-ddr4-3200mhz-c16-branco-tf4d48g3200hc16c01 |
Link da imagem | https://media.pichau.com.br/media/catalog/product/cache/2f958555330323e505eba7ce930bdf27/t/f/tf4d48g3200hc16c011.jpg |
Horário de Scraping | 09/07/2022 23:22:34 |
📜Project developed in the Entra21 Matutine Python Class