Skip to content

Commit 3e4d816

Browse files
committed
Add documentation about MS Sentinel data source
1 parent ffd9567 commit 3e4d816

File tree

1 file changed

+67
-2
lines changed

1 file changed

+67
-2
lines changed

README.md

+67-2
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Based on [PySpark DataSource API](https://spark.apache.org/docs/preview/api/pyth
2020
2121
## Splunk data source
2222

23-
Right now only implements writing to Splunk - both batch & streaming. Registered data source name is `splunk`.
23+
Right now only implements writing to [Splunk](https://www.splunk.com/) - both batch & streaming. Registered data source name is `splunk`.
2424

2525
By default, this data source will put all columns into the `event` object and send it to Splunk together with metadata (`index`, `source`, ...). This behavior could be changed by providing `single_event_column` option to specify which string column should be used as the single value of `event`.
2626

@@ -63,7 +63,7 @@ stream = sdf.writeStream.format("splunk") \
6363

6464
Supported options:
6565

66-
- `url` (string, required) - URL of the Splunk HTTP Event Collector (HEC) endpoint to send data to. For example, `http://localhost:8088/services/collector/event`.
66+
- `url` (string, required) - URL of the Splunk HTTP Event Collector (HEC) endpoint to send data to. For example, `http://localhost:8088/collector/services/event`.
6767
- `token` (string, required) - HEC token to [authenticate to HEC endpoint](https://docs.splunk.com/Documentation/Splunk/9.3.1/Data/FormateventsforHTTPEventCollector#HTTP_authentication).
6868
- `index` (string, optional) - name of the Splunk index to send data to. If omitted, the default index configured for HEC endpoint is used.
6969
- `source` (string, optional) - the source value to assign to the event data.
@@ -75,6 +75,71 @@ Supported options:
7575
- `remove_indexed_fields` (boolean, optional, default: `false`) - if indexed fields should be removed from the `event` object.
7676
- `batch_size` (int. optional, default: 50) - the size of the buffer to collect payload before sending to Splunk.
7777

78+
## Microsoft Sentinel / Azure Monitor
79+
80+
Right now only implements writing to [Microsoft Sentinel](https://learn.microsoft.com/en-us/azure/sentinel/overview/) - both batch & streaming. Registered data source name is `ms-sentinel`. The integration uses [Logs Ingestion API of Azure Monitor](https://learn.microsoft.com/en-us/azure/sentinel/create-custom-connector#connect-with-the-log-ingestion-api), so it's also exposed as `azure-monitor`.
81+
82+
To push data you need to create Data Collection Endpoint (DCE), Data Collection Rule (DCR), and create a custom table in Log Analytics workspace. See [documentation](https://learn.microsoft.com/en-us/azure/azure-monitor/logs/logs-ingestion-api-overview) for description of this process. The structure of the data in DataFrame should match the structure of the defined custom table.
83+
84+
This connector uses Azure Service Principal Client ID/Secret for authentication - you need to grant correct permissions (`Monitoring Metrics Publisher`) to the service principal on the DCE and DCR.
85+
86+
Batch usage:
87+
88+
```python
89+
from cyber_connectors import *
90+
spark.dataSource.register(MicrosoftSentinelDataSource)
91+
92+
sentinel_options = {
93+
"dce": dc_endpoint,
94+
"dcr_id": dc_rule_id,
95+
"dcs": dc_stream_name,
96+
"tenant_id": tenant_id,
97+
"client_id": client_id,
98+
"client_secret": client_secret,
99+
}
100+
101+
df = spark.range(10)
102+
df.write.format("ms-sentinel") \
103+
.mode("overwrite") \
104+
.options(**sentinel_options) \
105+
.save()
106+
```
107+
108+
Streaming usage:
109+
110+
```python
111+
from cyber_connectors import *
112+
spark.dataSource.register(MicrosoftSentinelDataSource)
113+
114+
dir_name = "tests/samples/json/"
115+
bdf = spark.read.format("json").load(dir_name) # to infer schema - not use in the prod!
116+
117+
sdf = spark.readStream.format("json").schema(bdf.schema).load(dir_name)
118+
119+
sentinel_stream_options = {
120+
"dce": dc_endpoint,
121+
"dcr_id": dc_rule_id,
122+
"dcs": dc_stream_name,
123+
"tenant_id": tenant_id,
124+
"client_id": client_id,
125+
"client_secret": client_secret,
126+
"checkpointLocation": "/tmp/splunk-checkpoint/"
127+
}
128+
stream = sdf.writeStream.format("splunk") \
129+
.trigger(availableNow=True) \
130+
.options(**sentinel_stream_options).start()
131+
```
132+
133+
Supported options:
134+
135+
- `dce` (string, required) - URL of the Data Collection Endpoint.
136+
- `dcr_id` (string, required) - ID of Data Collection Rule.
137+
- `dcs` (string, required) - name of custom table created in the Log Analytics Workspace.
138+
- `tenant_id` (string, required) - Azure Tenant ID.
139+
- `client_id` (string, required) - Application ID (client ID) of Azure Service Principal.
140+
- `client_secret` (string, required) - Client Secret of Azure Service Principal.
141+
- `batch_size` (int. optional, default: 50) - the size of the buffer to collect payload before sending to MS Sentinel.
142+
78143
## Simple REST API
79144

80145
Right now only implements writing to arbitrary REST API - both batch & streaming. Registered data source name is `rest`.

0 commit comments

Comments
 (0)