Skip to content

Commit 0d1c2a3

Browse files
authored
Merge pull request #59 from VinciGit00/ollama_integration
add examples for Local models
2 parents 8ff27be + e0978c5 commit 0d1c2a3

28 files changed

+429
-39
lines changed

.github/workflows/pylint.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,4 @@ jobs:
2020
pip install pylint
2121
pip install -r requirements.txt
2222
- name: Analysing the code with pylint
23-
run: pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py scrapegraphai/*.py examples/**/*.py tests/**/*.py
23+
run: pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py scrapegraphai/*.py

README.md

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,45 @@ Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.c
4343
You can use the `SmartScraper` class to extract information from a website using a prompt.
4444

4545
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
46-
### Case 1: Extracting informations using a local LLM
46+
### Case 1: Extracting informations using Ollama
47+
Remember to download the model on Ollama separately!
48+
```python
49+
from scrapegraphai.graphs import SmartScraperGraph
50+
51+
graph_config = {
52+
"llm": {
53+
"model": "ollama/mistral",
54+
"temperature": 0,
55+
"format": "json", # Ollama needs the format to be specified explicitly
56+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
57+
},
58+
"embeddings": {
59+
"model": "ollama/nomic-embed-text",
60+
"temperature": 0,
61+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
62+
}
63+
}
64+
65+
smart_scraper_graph = SmartScraperGraph(
66+
prompt="List me all the news with their description.",
67+
# also accepts a string with the already downloaded HTML code
68+
source="https://perinim.github.io/projects",
69+
config=graph_config
70+
)
71+
72+
result = smart_scraper_graph.run()
73+
print(result)
74+
75+
```
76+
77+
### Case 2: Extracting informations using Docker
4778

4879
Note: before using the local model remeber to create the docker container!
4980
```text
5081
docker-compose up -d
5182
docker exec -it ollama ollama run stablelm-zephyr
5283
```
53-
You can use which model you want instead of stablelm-zephyr
84+
You can use which models avaiable on Ollama or your own model instead of stablelm-zephyr
5485
```python
5586
from scrapegraphai.graphs import SmartScraperGraph
5687

@@ -75,7 +106,7 @@ print(result)
75106
```
76107

77108

78-
### Case 2: Extracting informations using Openai model
109+
### Case 3: Extracting informations using Openai model
79110
```python
80111
from scrapegraphai.graphs import SmartScraperGraph
81112
OPENAI_API_KEY = "YOUR_API_KEY"
@@ -98,7 +129,7 @@ result = smart_scraper_graph.run()
98129
print(result)
99130
```
100131

101-
### Case 3: Extracting informations using Gemini
132+
### Case 4: Extracting informations using Gemini
102133
```python
103134
from scrapegraphai.graphs import SmartScraperGraph
104135
GOOGLE_APIKEY = "YOUR_API_KEY"

examples/gemini/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
This folder contains an example of how to use ScrapeGraph-AI with Gemini, a large language model (LLM) from Google AI. The example shows how to extract information from a website using a natural language prompt.

examples/gemini/results/result.csv

Lines changed: 0 additions & 2 deletions
This file was deleted.

examples/gemini/results/result.json

Lines changed: 0 additions & 1 deletion
This file was deleted.

examples/local_models/Docker/readme.md

Whitespace-only changes.

examples/local_models/smart_scraper_local.py renamed to examples/local_models/Docker/smart_scraper_docker.py

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,27 +6,14 @@
66
# ************************************************
77
# Define the configuration for the graph
88
# ************************************************
9-
"""
10-
Avaiable models:
11-
- ollama/llama2
12-
- ollama/mistral
13-
- ollama/codellama
14-
- ollama/dolphin-mixtral
15-
- ollama/mistral-openorca
16-
"""
179

1810
graph_config = {
1911
"llm": {
2012
"model": "ollama/mistral",
2113
"temperature": 0,
2214
"format": "json", # Ollama needs the format to be specified explicitly
2315
# "model_tokens": 2000, # set context length arbitrarily,
24-
# "base_url": "http://ollama:11434", # set ollama URL arbitrarily
2516
},
26-
"embeddings": {
27-
"model": "ollama/nomic-embed-text",
28-
"temperature": 0,
29-
}
3017
}
3118

3219
# ************************************************
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
<?xml version="1.0"?>
2+
<catalog>
3+
<book id="bk101">
4+
<author>Gambardella, Matthew</author>
5+
<title>XML Developer's Guide</title>
6+
<genre>Computer</genre>
7+
<price>44.95</price>
8+
<publish_date>2000-10-01</publish_date>
9+
<description>An in-depth look at creating applications
10+
with XML.</description>
11+
</book>
12+
<book id="bk102">
13+
<author>Ralls, Kim</author>
14+
<title>Midnight Rain</title>
15+
<genre>Fantasy</genre>
16+
<price>5.95</price>
17+
<publish_date>2000-12-16</publish_date>
18+
<description>A former architect battles corporate zombies,
19+
an evil sorceress, and her own childhood to become queen
20+
of the world.</description>
21+
</book>
22+
<book id="bk103">
23+
<author>Corets, Eva</author>
24+
<title>Maeve Ascendant</title>
25+
<genre>Fantasy</genre>
26+
<price>5.95</price>
27+
<publish_date>2000-11-17</publish_date>
28+
<description>After the collapse of a nanotechnology
29+
society in England, the young survivors lay the
30+
foundation for a new society.</description>
31+
</book>
32+
<book id="bk104">
33+
<author>Corets, Eva</author>
34+
<title>Oberon's Legacy</title>
35+
<genre>Fantasy</genre>
36+
<price>5.95</price>
37+
<publish_date>2001-03-10</publish_date>
38+
<description>In post-apocalypse England, the mysterious
39+
agent known only as Oberon helps to create a new life
40+
for the inhabitants of London. Sequel to Maeve
41+
Ascendant.</description>
42+
</book>
43+
<book id="bk105">
44+
<author>Corets, Eva</author>
45+
<title>The Sundered Grail</title>
46+
<genre>Fantasy</genre>
47+
<price>5.95</price>
48+
<publish_date>2001-09-10</publish_date>
49+
<description>The two daughters of Maeve, half-sisters,
50+
battle one another for control of England. Sequel to
51+
Oberon's Legacy.</description>
52+
</book>
53+
<book id="bk106">
54+
<author>Randall, Cynthia</author>
55+
<title>Lover Birds</title>
56+
<genre>Romance</genre>
57+
<price>4.95</price>
58+
<publish_date>2000-09-02</publish_date>
59+
<description>When Carla meets Paul at an ornithology
60+
conference, tempers fly as feathers get ruffled.</description>
61+
</book>
62+
<book id="bk107">
63+
<author>Thurman, Paula</author>
64+
<title>Splish Splash</title>
65+
<genre>Romance</genre>
66+
<price>4.95</price>
67+
<publish_date>2000-11-02</publish_date>
68+
<description>A deep sea diver finds true love twenty
69+
thousand leagues beneath the sea.</description>
70+
</book>
71+
<book id="bk108">
72+
<author>Knorr, Stefan</author>
73+
<title>Creepy Crawlies</title>
74+
<genre>Horror</genre>
75+
<price>4.95</price>
76+
<publish_date>2000-12-06</publish_date>
77+
<description>An anthology of horror stories about roaches,
78+
centipedes, scorpions and other insects.</description>
79+
</book>
80+
<book id="bk109">
81+
<author>Kress, Peter</author>
82+
<title>Paradox Lost</title>
83+
<genre>Science Fiction</genre>
84+
<price>6.95</price>
85+
<publish_date>2000-11-02</publish_date>
86+
<description>After an inadvertant trip through a Heisenberg
87+
Uncertainty Device, James Salway discovers the problems
88+
of being quantum.</description>
89+
</book>
90+
<book id="bk110">
91+
<author>O'Brien, Tim</author>
92+
<title>Microsoft .NET: The Programming Bible</title>
93+
<genre>Computer</genre>
94+
<price>36.95</price>
95+
<publish_date>2000-12-09</publish_date>
96+
<description>Microsoft's .NET initiative is explored in
97+
detail in this deep programmer's reference.</description>
98+
</book>
99+
<book id="bk111">
100+
<author>O'Brien, Tim</author>
101+
<title>MSXML3: A Comprehensive Guide</title>
102+
<genre>Computer</genre>
103+
<price>36.95</price>
104+
<publish_date>2000-12-01</publish_date>
105+
<description>The Microsoft MSXML3 parser is covered in
106+
detail, with attention to XML DOM interfaces, XSLT processing,
107+
SAX and more.</description>
108+
</book>
109+
<book id="bk112">
110+
<author>Galos, Mike</author>
111+
<title>Visual Studio 7: A Comprehensive Guide</title>
112+
<genre>Computer</genre>
113+
<price>49.95</price>
114+
<publish_date>2001-04-16</publish_date>
115+
<description>Microsoft Visual Studio 7 is explored in depth,
116+
looking at how Visual Basic, Visual C++, C#, and ASP+ are
117+
integrated into a comprehensive development
118+
environment.</description>
119+
</book>
120+
</catalog>
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
<body class="fixed-top-nav " style="padding-top: 57px;">
2+
<header>
3+
<nav id="navbar" class="navbar navbar-light navbar-expand-sm fixed-top">
4+
<div class="container">
5+
<a class="navbar-brand title font-weight-lighter" href="/"><span class="font-weight-bold">Marco&nbsp;</span>Perini</a> <button class="navbar-toggler collapsed ml-auto" type="button" data-toggle="collapse" data-target="#navbarNav" aria-controls="navbarNav" aria-expanded="false" aria-label="Toggle navigation"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar top-bar"></span> <span class="icon-bar middle-bar"></span> <span class="icon-bar bottom-bar"></span> </button>
6+
<div class="collapse navbar-collapse text-right" id="navbarNav">
7+
<ul class="navbar-nav ml-auto flex-nowrap">
8+
<li class="nav-item "> <a class="nav-link" href="/">About</a> </li>
9+
<li class="nav-item dropdown active">
10+
<a class="nav-link dropdown-toggle" href="#" id="navbarDropdown" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">Projects<span class="sr-only">(current)</span></a>
11+
<div class="dropdown-menu dropdown-menu-right" aria-labelledby="navbarDropdown">
12+
<a class="dropdown-item" href="/projects/">Projects</a>
13+
<div class="dropdown-divider"></div>
14+
<a class="dropdown-item" href="/competitions/">Competitions</a>
15+
</div>
16+
</li>
17+
<li class="nav-item "> <a class="nav-link" href="/cv/">CV</a> </li>
18+
<li class="toggle-container"> <button id="light-toggle" title="Change theme"> <i class="fa-solid fa-moon"></i> <i class="fa-solid fa-sun"></i> </button> </li>
19+
</ul>
20+
</div>
21+
</div>
22+
</nav>
23+
<progress id="progress" value="0" max="284" style="top: 57px;">
24+
<div class="progress-container"> <span class="progress-bar"></span> </div>
25+
</progress>
26+
</header>
27+
<div class="container mt-5">
28+
<div class="post">
29+
<header class="post-header">
30+
<h1 class="post-title">Projects</h1>
31+
<p class="post-description"></p>
32+
</header>
33+
<article>
34+
<div class="projects">
35+
<div class="grid" style="position: relative; height: 861.992px;">
36+
<div class="grid-sizer"></div>
37+
<div class="grid-item" style="position: absolute; left: 0px; top: 0px;">
38+
<a href="/projects/rotary-pendulum-rl/">
39+
<div class="card hoverable">
40+
<figure>
41+
<picture> <img src="/assets/img/rotary_pybullet.jpg" width="auto" height="auto" alt="project thumbnail" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"> </picture>
42+
</figure>
43+
<div class="card-body">
44+
<h4 class="card-title">Rotary Pendulum RL</h4>
45+
<p class="card-text">Open Source project aimed at controlling a real life rotary pendulum using RL algorithms</p>
46+
<div class="row ml-1 mr-1 p-0"> </div>
47+
</div>
48+
</div>
49+
</a>
50+
</div>
51+
<div class="grid-sizer"></div>
52+
<div class="grid-item" style="position: absolute; left: 260px; top: 0px;">
53+
<a href="https://github.com/PeriniM/DQN-SwingUp" rel="external nofollow noopener" target="_blank">
54+
<div class="card hoverable">
55+
<figure>
56+
<picture> <img src="/assets/img/value-policy-heatmaps.jpg" width="auto" height="auto" alt="project thumbnail" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"> </picture>
57+
</figure>
58+
<div class="card-body">
59+
<h4 class="card-title">DQN Implementation from scratch</h4>
60+
<p class="card-text">Developed a Deep Q-Network algorithm to train a simple and double pendulum</p>
61+
<div class="row ml-1 mr-1 p-0"> </div>
62+
</div>
63+
</div>
64+
</a>
65+
</div>
66+
<div class="grid-sizer"></div>
67+
<div class="grid-item" style="position: absolute; left: 0px; top: 447.414px;">
68+
<a href="https://github.com/PeriniM/Multi-Agents-HAED" rel="external nofollow noopener" target="_blank">
69+
<div class="card hoverable">
70+
<figure>
71+
<picture> <img src="/assets/img/multi_agents_haed.gif" width="auto" height="auto" alt="project thumbnail" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"> </picture>
72+
</figure>
73+
<div class="card-body">
74+
<h4 class="card-title">Multi Agents HAED</h4>
75+
<p class="card-text">University project which focuses on simulating a multi-agent system to perform environment mapping. Agents, equipped with sensors, explore and record their surroundings, considering uncertainties in their readings.</p>
76+
<div class="row ml-1 mr-1 p-0"> </div>
77+
</div>
78+
</div>
79+
</a>
80+
</div>
81+
<div class="grid-sizer"></div>
82+
<div class="grid-item" style="position: absolute; left: 260px; top: 370.172px;">
83+
<a href="/projects/wireless-esc-drone/">
84+
<div class="card hoverable">
85+
<figure>
86+
<picture> <img src="/assets/img/wireless_esc.gif" width="auto" height="auto" alt="project thumbnail" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"> </picture>
87+
</figure>
88+
<div class="card-body">
89+
<h4 class="card-title">Wireless ESC for Modular Drones</h4>
90+
<p class="card-text">Modular drone architecture proposal and proof of concept. The project received maximum grade.</p>
91+
<div class="row ml-1 mr-1 p-0"> </div>
92+
</div>
93+
</div>
94+
</a>
95+
</div>
96+
</div>
97+
</div>
98+
</article>
99+
</div>
100+
</div>
101+
<footer class="fixed-bottom">
102+
<div class="container mt-0"> © Copyright 2023 Marco Perini. Powered by <a href="https://jekyllrb.com/" target="_blank" rel="external nofollow noopener">Jekyll</a> with <a href="https://github.com/alshedivat/al-folio" rel="external nofollow noopener" target="_blank">al-folio</a> theme. Hosted by <a href="https://pages.github.com/" target="_blank" rel="external nofollow noopener">GitHub Pages</a>. </div>
103+
</footer>
104+
<div class="hiddendiv common"></div>
105+
</body>

examples/local_models/Ollama/readme.md

Whitespace-only changes.
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper from text
3+
"""
4+
5+
import os
6+
from scrapegraphai.graphs import SmartScraperGraph
7+
from scrapegraphai.utils import convert_to_csv, convert_to_json
8+
9+
# ************************************************
10+
# Read the text file
11+
# ************************************************
12+
13+
FILE_NAME = "inputs/plain_html_example.txt"
14+
curr_dir = os.path.dirname(os.path.realpath(__file__))
15+
file_path = os.path.join(curr_dir, FILE_NAME)
16+
17+
# It could be also a http request using the request model
18+
with open(file_path, 'r', encoding="utf-8") as file:
19+
text = file.read()
20+
21+
# ************************************************
22+
# Define the configuration for the graph
23+
# ************************************************
24+
25+
graph_config = {
26+
"llm": {
27+
"model": "ollama/mistral",
28+
"temperature": 0,
29+
"format": "json", # Ollama needs the format to be specified explicitly
30+
# "model_tokens": 2000, # set context length arbitrarily
31+
"base_url": "http://localhost:11434",
32+
},
33+
"embeddings": {
34+
"model": "ollama/nomic-embed-text",
35+
"temperature": 0,
36+
"base_url": "http://localhost:11434",
37+
}
38+
}
39+
40+
# ************************************************
41+
# Create the SmartScraperGraph instance and run it
42+
# ************************************************
43+
44+
smart_scraper_graph = SmartScraperGraph(
45+
prompt="List me all the news with their description.",
46+
source=text,
47+
config=graph_config
48+
)
49+
50+
result = smart_scraper_graph.run()
51+
print(result)
52+
53+
# Save to json or csv
54+
convert_to_csv(result, "result")
55+
convert_to_json(result, "result")

0 commit comments

Comments
 (0)