Skip to content

Commit

Permalink
📝 Explain how to use Syphoon proxies and news commands
Browse files Browse the repository at this point in the history
  • Loading branch information
poneoneo committed Sep 20, 2024
1 parent e5eac4f commit 6dec24e
Showing 1 changed file with 119 additions and 104 deletions.
223 changes: 119 additions & 104 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,15 +67,20 @@ Alibaba-CLI-Scraper is a python CLI tool designed to scrape, save and interact i
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Using the CLI Interface](#using-the-cli-interface)
- [Available Commands:](#available-commands)
- [Important Informations:](#important-informations)
- [How to set My API KEY ?](#how-to-set-my-api-key-)
- [Important Informations](#important-informations)
- [Available Sub-Commands](#available-sub-commands)
- [Scraper and syphoon-scraper subcommands](#scraper-and-syphoon-scraper-subcommands)
- [How to set My API KEY ?](#how-to-set-my-api-key-)
- [db-init Sub-Command](#db-init-sub-command)
- [db-update Sub-Command](#db-update-sub-command)
- [export-as-csv Sub-Command](#export-as-csv-sub-command)
- [ai-agent Sub-Command](#ai-agent-sub-command)
- [Contributions Welcome!](#contributions-welcome)
- [License](#license)

### Features:

- **Asynchronous API:** Utilizes asynchronous API of Playwright and Brightdata Proxies for efficient handling of numerous pages results.
- **2 Proxy Providers Support:** Provide support for Syphoon and Bright Data.

- **Text Mode:** Provides easy-to-use commands with text mode for those who are not comfortable with commands in the terminal.

Expand Down Expand Up @@ -153,9 +158,7 @@ When you will run command to export your sqlite file as a csv a `OUTER FULL JOIN

- Python 3.11 or Higher

- Scraping Browser API KEY from [BrightData](https://get.brightdata.com/fdrqnme1smdc) to know how to set your api key look at [here](#available-commands)

- Windows or Linux as OS
- API KEY from [Syphoon](https://account.syphoon.com) or [BrightData](https://get.brightdata.com/fdrqnme1smdc) to know how to set your api key look at [here](#available-commands)

### Installation

Expand Down Expand Up @@ -195,7 +198,7 @@ If you'd like to use `pip` instead, just replace `pipx` with `pip` but obviously

<div align="center">
<p>
<a href="#"><img src="images\help-cli-2.png" width="900" height="340" alt="aba-run help image" /></a>
<a href="#"><img src="images\aba-run-help-image.png" width="900" height="340" alt="aba-run help image" /></a>
</p>

</div>
Expand All @@ -208,163 +211,175 @@ If you'd like to use `pip` instead, just replace `pipx` with `pip` but obviously

---

#### Available Commands:

##### Important Informations:
#### Important Informations

- **`initialize` :** Means create a new Mysql or SQLite database with products and suppliers table in it. Which will be used to store your scraped data. Especially for mysql engine, you will need to create an empty database in your mysql server before.
- **`update` :** Means add your scraped data to a newly initialized Mysql or SQLite database. this action cannot be performed twice on the same database.

<details>
<summary>Scraper Demo</summary>
- **`update` :** Means add your scraped data to a newly initialized Mysql or SQLite database. this action cannot be performed twice on the same database.

[https://user-images.githubusercontent.com/49741340/238535232-459847af-a15c-4d9b-91ac-fab9958bc74f.mp4](https://private-user-images.githubusercontent.com/52409392/351958081-302e7586-7e73-495f-bb40-41b8002c0480.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE4ODE5NTUsIm5iZiI6MTcyMTg4MTY1NSwicGF0aCI6Ii81MjQwOTM5Mi8zNTE5NTgwODEtMzAyZTc1ODYtN2U3My00OTVmLWJiNDAtNDFiODAwMmMwNDgwLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI1VDA0MjczNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQwZjlhM2I5ODk5ZjYzYmRhNmI5OWJhMzMxMDU2MGQ4NTBiZTk0OTAzNDg5M2M3NTU1M2NhYzFkYTc1YzQ5YzcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.xOTTGRRU8UQ8YZFMegl_TJC6kvtCR4aQEwJyp_DAjjk)
- **`aba-run`:** is the base command means all other commands that will be introduce bellow are sub-commands and should always be preceded by `aba-run`.
Practice make perfect isn't ? So let's get started with a use case example.
Let's assume that you want to scrape data about `electric bikes` from Alibaba.com.
#### Available Sub-Commands

</details>
##### Scraper and syphoon-scraper subcommands
<div align="center">
<p>
<a href="#"><img src="images\syphoon-scraper-help-options.png" width="900" height="340" alt="aba-run help image" /></a>
</p>

##### How to set My API KEY ?
</div>

by default `scrapper` will use async mode which is supported by brightdata api, means if you want to use it you will need to provide your api key. set it by using command bellow:
Based on which proxy provider you like, you need to choose between two sub-commands.

```shell
aba-run set-api-key your_api_key
```
- **`syphoon-scraper` :** if you want to use syphoon proxy provider.

and now run `scraper` sub-command without `--sync-api` flag to use async mode.
- **`scraper` :** which have been set to use brightdata proxy provider by default.

- **`scraper` sub-command:** Initiates scraping of Alibaba.com based on the provided keywords.
this command takes two required arguments and one optional argument:
_ - **`key_words` (required):** The search term(s) for finding products on Alibaba. Enclose multiple keywords in quotes.
_ - **`--page-results` or `-pr` (required):** Usually keys words will results to many pages macthing them. Then you must to indicate how many of them you want to pull out.If any value is not provided `10` will be used by default.
_ - **`--html-folder` or `-hf` (optional):** Specifies the directory to store the raw HTML files. If omitted, a folder with sanitized keywords as name will be automatically created. In this case `electric_bikes` will be used as a results folder name.
_ - **`--sync-api` or `-sa` (optional):** flag indicates that you want to use sync mode. By default `async` mode is used.
###### How to set My API KEY ?

**Example**:
```shell
aba-run scraper "electric bikes" -hf "bike_results" -pr 15
```
```shell
aba-run set-api-key your_proxy_provider_name
```
replace `your_proxy_provider_name` with `syphoon` or `brightdata` based on your choice.
after that, a message will appear waiting for you to set your api key.

However if you want to use sync mode you can use :
Both of the above sub-commands have the same options but i will use `syphoon-scraper` sub-command as an example.

```bash
aba-run scraper "electric bikes" -hf "bike_results" -pr 15 --sync-api/-sa
```

and voila!
- **`scraper` sub-command:** Initiates scraping of Alibaba.com based on the provided keywords.
this command takes two required arguments and one optional argument:
- **`key_words` (required):** The search term(s) for finding products on Alibaba. Enclose multiple keywords in quotes.
- **`--page-results` or `-pr` (required):** Usually keys words will results to many pages macthing them. Then you must to indicate how many of them you want to pull out.If any value is not provided `10` will be used by default.
- **`--html-folder` or `-hf` (optional):** Specifies the directory to store the raw HTML files. If omitted, a folder with sanitized keywords as name will be automatically created. In this case `electric_bikes` will be used as a results folder name.
- **`--sync-api` or `-sa` (optional):** flag indicates that you want to use sync mode. By default `async` mode is used.

Now `bike_results` (since you already provided name you wish to have) directory has been created and should contains all html files from alibaba.com matching your keywords.
**Example**:
```shell
aba-run syphoon-scraper "electric bikes" -hf "bike_results" -pr 15
```

---
Sync mode is also available but you will quickly get block:

<details>
<summary>db-init Demo with sqlite</summary>
```shell
aba-run syphoon-scraper "electric bikes" -hf "bike_results" -pr 15 --sync-api/-sa
```
and voila!

[https://user-images.githubusercontent.com/49741340/238535232-459847af-a15c-4d9b-91ac-fab9958bc74f.mp4](https://private-user-images.githubusercontent.com/52409392/351970999-0f9491e5-69f0-470b-8a8e-9e436f0a0d0b.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE4ODQ0MjksIm5iZiI6MTcyMTg4NDEyOSwicGF0aCI6Ii81MjQwOTM5Mi8zNTE5NzA5OTktMGY5NDkxZTUtNjlmMC00NzBiLThhOGUtOWU0MzZmMGEwZDBiLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI1VDA1MDg0OVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWRiMmJiZTg3MDE3NTAwZWRhMzE2MTM5NDhjNmZkZTAwZWYxOTUxN2RlMzA1NGM4MzgyNWJkZTJmNTNkNzFhNDAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.gjlDs3VMK_MIg1Ne3cEqrO__LsZdyOgjy-SlZuXvd_s)
Now `bike_results` (since you already provided name you wish to have) directory has been created and should contains all html files from alibaba.com matching your keywords.

</details>
##### db-init Sub-Command
<div align="center">
<p>
<a href="#"><img src="images\db-init-help-options.png" width="900" height="340" alt="aba-run help image" /></a>
</p>
</div>

- **`db-init` sub-command:** Creates a new database mysql/sqlite with products and suppliers as tables in it.
this command takes one required arguments and six optional arguments(depends on engine you choose):
_ - **`engine` (required):** Choose either `sqlite` or `mysql`. Is set to `sqlite` by default.
_ - **`--sqlite-file` or `-f`(optional, SQLite only):** The name for your SQLite database file (without any extension). \* - **`--host` or `-h`, `--port` or `-p`, `--user` or `-u`, `--password` or `-pw`, `--db-name`or `-db` (required for MySQL):** Your MySQL database connection details.
- **`db-init` sub-command:** Creates a new database mysql/sqlite with products and suppliers as tables in it.
this command takes one required arguments and six optional arguments(depends on engine you choose):
- **`engine` (required):** Choose either `sqlite` or `mysql`. Is set to `sqlite` by default.
- **`--sqlite-file` or `-f`(optional, SQLite only):** The name for your SQLite database file (without any extension). \* - **`--host` or `-h`, `--port` or `-p`, `--user` or `-u`, `--password` or `-pw`, `--db-name`or `-db` (required for MySQL):** Your MySQL database connection details.
* **`--only-with` or `-ow`(optional Mysql):** If you just want to update some details of your credentials in `db_credentials.json` file but not all, use this flag.

* **`--only-with` or `-ow`(optional Mysql):** If you just want to update some details of your credentials in `db_credentials.json` file but not all, use this flag.

- **NB:** `--host` and `--port` are respectively set to `localhost` and `3306` by default. Also When you initialize your database with Mysql Engine for the first time, you must to set `--user`, `--password` and `--db-name` arguments. this will create a `db_credentials.json` file in your current directory with your credentials. Prevent you to set it again next time. Thus you will be able to set just import field when the time will come to [update](#important-informations) your database.
**NB:** `--host` and `--port` are respectively set to `localhost` and `3306` by default. Also When you initialize your database with Mysql Engine for the first time, you must to set `--user`, `--password` and `--db-name` arguments. this will create a `db_credentials.json` file in your current directory with your credentials. Prevent you to set it again next time. Thus you will be able to set just import field when the time will come to [update](#important-informations) your database.

**MySQL Use case:**
**MySQL Use case:**

```shell
aba-run db-init mysql -u "mysql_username" -pw "mysql_password" -db "alibaba_products"
```
```shell
aba-run db-init mysql -u "mysql_username" -pw "mysql_password" -db "alibaba_products"
```

**SQLite Use case :**
**SQLite Use case :**

```shell
aba-run db-init sqlite --sqlite-file alibaba_data
```
```shell
aba-run db-init sqlite --sqlite-file alibaba_data
```

db-init subcommand will try to use sqlite engine by default so if you are planning to use it run as bellow :
db-init subcommand will try to use sqlite engine by default so if you are planning to use it run as bellow :

**SQLite Use case V2 :**
**SQLite Use case V2 :**

```shell
aba-run db-init -f alibaba_data
```
```shell
aba-run db-init -f alibaba_data
```

As soons as your database has been initialized, you can update it with the scraped data.
As soons as your database has been initialized, you can update it with the scraped data.

---

<details>
<summary>db-update Demo</summary>

[https://user-images.githubusercontent.com/49741340/238535232-459847af-a15c-4d9b-91ac-fab9958bc74f.mp4](https://private-user-images.githubusercontent.com/52409392/351977812-ecfe8e3b-af20-4611-a07d-dd6ac401bf8c.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE4ODY1NjgsIm5iZiI6MTcyMTg4NjI2OCwicGF0aCI6Ii81MjQwOTM5Mi8zNTE5Nzc4MTItZWNmZThlM2ItYWYyMC00NjExLWEwN2QtZGQ2YWM0MDFiZjhjLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI1VDA1NDQyOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTY4OTI3NDY0OThmYTA5NDJiMTEyMjBhZjczNDE2M2RkZWNlMmRjMzUyZjBkODgwMjY2NTc1NzQ1NjI5MTBmMzcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Fa2jqUTAg-MQ-J33x6x_UIjnId190KhdEJZbUmmbTiw)
##### db-update Sub-Command
<div align="center">
<p>
<a href="#"><img src="images\db-update-help.png" width="900" height="340" alt="aba-run help image" /></a>
</p>

</details>
</div>

- **`db-update` sub-command:** add scraped data from html files to your database (you can't use this command twice with same database name to avoid UNIQUE CONSTRAINT ERROR).
- **`db-update` sub-command:** add scraped data from html files to your database (you can't use this command twice with same database name to avoid UNIQUE CONSTRAINT ERROR).
this command takes two required arguments and two optional arguments:
_ - **`--db-engine` (required):** Select your database engine: `sqlite` or `mysql`. Is set to `sqlite` by default.
_ - **`--kw-results`/`-kr` (required):** The path to the folder containing the HTML files generated by the `scraper` sub command.
_ - **`--filename`/`-f` (required for SQLite):** If you're using SQLite, provide the desired filename for your database. whitout any extension.
_ - **`--db-name`/`-db` (optional for MySQL):** If you're using MySQL engine, and want to push the data to a different database, provide the desired database name.
this command takes two required arguments and two optional arguments:
- **`--db-engine` (required):** Select your database engine: `sqlite` or `mysql`. Is set to `sqlite` by default.
- **`--kw-results`/`-kr` (required):** The path to the folder containing the HTML files generated by the `scraper` sub command.
- **`--filename`/`-f` (required for SQLite):** If you're using SQLite, provide the desired filename for your database. whitout any extension.
- **`--db-name`/`-db` (optional for MySQL):** If you're using MySQL engine, and want to push the data to a different database, provide the desired database name.
**MySQL Use case:**
**MySQL Use case:**
command bellow assuming that you already have your database credentials in `db_credentials.json` file to autocomplete required parameter. if not this will raise an error.
command bellow assuming that you already have your database credentials in `db_credentials.json` file to autocomplete required parameter. if not this will raise an error.
```shell
aba-run db-update mysql --kw-results bike_results\
```
```shell
aba-run db-update mysql --kw-results bike_results\
```
**NB:What if you want to change something while you updating the database? Assuming that you have run another scraping command and you want to save this data in another database name whitout update credential file or rewriting all theses parameter just to change your database name then, simply run `aba-run db-update mysql --kw-results another_keyword_folder_result\ --db-name "another_database_name"`.**
**NB:** What if you want to change something while you updating the database? Assuming that you have run another scraping command and you want to save this data in another database name whitout update credential file or rewriting all theses parameter just to change your database name then, simply run:
**SQLite Use case:**
```shell
aba-run db-update mysql --kw-results another_keyword_folder_result\ --db-name "another_database_name
```
```shell
aba-run db-update sqlite --kw-results bike_results\ --filename alibaba_data
```
**SQLite Use case:**
---
```shell
aba-run db-update sqlite --kw-results bike_results\ --filename alibaba_data
```
<details>
<summary> export-as-csv Demo</summary>
---
##### export-as-csv Sub-Command
<div align="center">
<p>
<a href="#"><img src="images\export-as-csv-demo.gif" width="900" height="340" alt="command result 1" /></a>
</p>
<p align="center">
<a href="#"><img src="images\export-as-csv-help.png" width="900" height="340" alt="aba-run help image" /></a>
</p>
</div>
</details>
- **`export-as-csv` sub-command:** Exports scraped data from your sqlitedatabase to a CSV file. This csv file will contain a `FULL OUTER JOIN` with the `products` and `suppliers` tables.
- **`export-as-csv` sub-command:** Exports scraped data from your sqlitedatabase to a CSV file. This csv file will contain a `FULL OUTER JOIN` with the `products` and `suppliers` tables.
this command takes one required argument and one optional argument:
this command takes one required argument and one optional argument:
- - **`--sqlite_file` (required):** The name for your SQLite database file with his extension.
- - **`--to` or `-t` (required):** The name for your CSV file with his extension.
- - **`--sqlite_file` (required):** The name for your SQLite database file with his extension.
- - **`--to` or `-t` (required):** The name for your CSV file with his extension.
<details>
<summary> ai-agent Demo</summary>
##### ai-agent Sub-Command
<div align="center">
<p>
<a href="#"><img src="images\ai-agent-help.png" width="900" height="340" alt="aba-run help image" /></a>
</p>
<div align="center">
</div>
</div>
</details>
The purpose of this command is to provide a way to interact with your scraped data in plain english.
- You will be able build a query i.e "list all suppliers in china". In this case the answer will be a pretty table with the name of the suppliers.
- or i.e "plot the price of all the products in china". In this case the answer will be a line chart with the price of all the products in china.
- **`ai-agent` sub-command:** Calls an Ai agent to interact with your scraped data in plain english.
- **`ai-agent` sub-command:** Calls an Ai agent to interact with your scraped data in plain english.
this command takes one required argument and one optional argument:
this command takes one required argument and one optional argument:
- - **`query` (required):** content of the query that you want to ask to the ai agent.
- - **`--csv-file` or `-f` (required):** The name for your CSV file with his extension.
- - **`query` (required):** content of the query that you want to ask to the ai agent.
- - **`--csv-file` or `-f` (required):** The name for your CSV file with his extension.
## Contributions Welcome!
Expand Down

0 comments on commit 6dec24e

Please # to comment.