Skip to content

Commit a12f9f6

Browse files
authored
Reviewed and edited doc updates (#3592)
1 parent b52da1e commit a12f9f6

File tree

7 files changed

+105
-103
lines changed

7 files changed

+105
-103
lines changed

docs/tutorials/features/FSDP.md

Lines changed: 16 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,14 @@ Fully Sharded Data Parallel (FSDP)
33

44
## Introduction
55

6-
`Fully Sharded Data Parallel (FSDP)` is a PyTorch\* module that provide industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training, this makes the training of some large-scale models feasible. Please refer to [FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) for an introduction to FSDP.
6+
`Fully Sharded Data Parallel (FSDP)` is a PyTorch\* module that provides industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training. This makes the training of some large-scale models feasible. Please refer to [FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) for an introduction to FSDP.
77

8-
To run FSDP on GPU, similar to DDP, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing collectives as `AllGather`, `ReduceScatter` etc. needed by FSDP. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
9-
The Installation steps of Intel® oneCCL Bindings for Pytorch\* follows the same steps as DDP.
8+
To run FSDP on GPU, similar to DDP, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing collectives such as `AllGather`, `ReduceScatter`, and other needed by FSDP. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
9+
To install Intel® oneCCL Bindings for Pytorch\*, follow the same installation steps as for DDP.
1010

1111
## FSDP Usage (GPU only)
1212

13-
FSDP follows its usage in PyTorch. To use FSDP with Intel® Extension for PyTorch\*, make the following modifications to your model script:
13+
FSDP is designed to align with PyTorch conventions. To use FSDP with Intel® Extension for PyTorch\*, make the following modifications to your model script:
1414

1515
1. Import the necessary packages.
1616
```python
@@ -25,7 +25,7 @@ from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
2525
dist.init_process_group(backend='ccl')
2626
```
2727

28-
3. For FSDP with each process exclusively works on a single GPU, set the device ID as `local rank`.
28+
3. For FSDP with each process exclusively working on a single GPU, set the device ID as `local rank`.
2929
```python
3030
torch.xpu.set_device("xpu:{}".format(rank))
3131
# or
@@ -39,13 +39,13 @@ model = model.to(device)
3939
model = FSDP(model, device_id=device)
4040
```
4141

42-
Note: for FSDP with XPU, you need to specify `device_ids` with XPU device, otherwise it will trigger the CUDA path and throw error.
42+
**Note**: for FSDP with XPU, you need to specify `device_ids` with XPU device; otherwise, it will trigger the CUDA path and throw an error.
4343

44-
## Example Usage:
44+
## Example
4545

46-
Here's an example based on [PyTorch FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) to illustrate the usage of FSDP on XPU and the necessary changes from a CUDA case to XPU case.
46+
Here's an example based on [PyTorch FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) to illustrate the usage of FSDP on XPU and the necessary changes to switch from CUDA to an XPU case.
4747

48-
1. Import necessary packages
48+
1. Import necessary packages:
4949

5050
```python
5151
"""
@@ -82,7 +82,7 @@ from torch.distributed.fsdp.wrap import (
8282
)
8383
```
8484

85-
2. Distributed training setup
85+
2. Set up distributed training:
8686

8787
```python
8888
"""
@@ -99,7 +99,7 @@ def cleanup():
9999
dist.destroy_process_group()
100100
```
101101

102-
3. Define the toy model for handwritten digit classification.
102+
3. Define the toy model for handwritten digit classification:
103103

104104
```python
105105
class Net(nn.Module):
@@ -129,7 +129,7 @@ class Net(nn.Module):
129129
return output
130130
```
131131

132-
4. Define a train function
132+
4. Define a training function:
133133

134134
```python
135135
"""
@@ -156,7 +156,7 @@ def train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler
156156
print('Train Epoch: {} \tLoss: {:.6f}'.format(epoch, ddp_loss[0] / ddp_loss[1]))
157157
```
158158

159-
5. Define a validation function
159+
5. Define a validation function:
160160

161161
```python
162162
"""
@@ -185,7 +185,7 @@ def test(model, rank, world_size, test_loader):
185185
100. * ddp_loss[1] / ddp_loss[2]))
186186
```
187187

188-
6. Define a distributed train function that wraps the model in FSDP
188+
6. Define a distributed training function that wraps the model in FSDP:
189189

190190
```python
191191
"""
@@ -256,7 +256,7 @@ def fsdp_main(rank, world_size, args):
256256
cleanup()
257257
```
258258

259-
7. Finally parse the arguments and set the main function
259+
7. Finally, parse the arguments and set the main function:
260260

261261
```python
262262
"""
@@ -292,9 +292,7 @@ if __name__ == '__main__':
292292
join=True)
293293
```
294294

295-
8. Running command
296-
297-
Put the above code snippets to a python script “FSDP_mnist_xpu.py”, and run:
295+
8. Put the above code snippets to a python script `FSDP_mnist_xpu.py`, and run:
298296

299297
```bash
300298
python FSDP_mnist_xpu.py

docs/tutorials/features/float8.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
1-
Float8 datatype support [GPU] (Experimental)
1+
Float8 Data Type Support [GPU] (Experimental)
22
============================================
33

4-
## Float8 DataType
4+
## Float8 Data Type
55

6-
Float8 (FP8) is 8-bit floating point which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.
6+
Float8 (FP8) is a 8-bit floating point data type, which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.
77

88
Two formats are used in FP8 training and inference, in order to meet the required value range and precision of activation, weight and gradient in Deep Neural Network (DNN). One is E4M3 (sign-exponent-mantissa) for activation and weight, the other is E5M2 for gradients. These two formats are defined in [FP8 FORMATS FOR DEEP LEARNING](https://arxiv.org/pdf/2209.05433.pdf).
99

10-
FP8 data type is used for memory storage only in current stage. It will be converted to BFloat16 data type for computation.
10+
FP8 data type is used for memory storage only in current stage. It will be converted to the BFloat16 data type for computation.
1111

1212
## FP8 Quantization
1313

1414
On GPU, online Dynamic Quantization is used for FP8 data compression and decompression. Delayed Scaling algorithm is used for accelerating the quantizaiton process.
1515

1616
## Supported running mode
1717

18-
Both DNN Training and Inference are supported with FP8 data type.
18+
Both DNN Training and Inference are supported with the FP8 data type.
1919

2020
## Supported operators
2121

docs/tutorials/features/int4.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
INT4 inference [GPU] (Experimentatal)
1+
INT4 inference [GPU] (Experimental)
22
=====================================
33

4-
## INT4 DataType
4+
## INT4 Data Type
55

6-
INT4 is 4-bit fixed point which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.
6+
INT4 is a 4-bit fixed point data type, which is used to reduce memory footprint, improve the computation efficiency, and save power in Deep Learning domain.
77

88
INT4 data type is being used in weight only quantization in current stage. It will be converted to Float16 data type for computation.
99

docs/tutorials/features/int8_overview_xpu.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
Intel® Extension for PyTorch\* optimizations for quantization [GPU]
1+
Intel® Extension for PyTorch\* Optimizations for Quantization [GPU]
22
===================================================================
33

4-
Intel® Extension for PyTorch\* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This tutorial illustrates the work flow of quantization on Intel GPUs.
4+
Intel® Extension for PyTorch\* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This section illustrates the quantization workflow on Intel GPUs.
55

6-
The overall view is that our usage follows the API defined in official PyTorch. Therefore, only small modification like moving model and data to GPU with to('xpu') is required. We highly recommend using the TorchScript for quantizing models. With graph model created via TorchScript, optimization like operator fusion (e.g. `conv_relu`) would be enabled automatically. This would deliver the best performance for int8 workloads.
6+
The overall view is that our usage follows the API defined in official PyTorch. Therefore, only small modification like moving model and data to GPU with `to('xpu')` is required. We highly recommend using the TorchScript for quantizing models. With graph model created via TorchScript, optimization like operator fusion (e.g. `conv_relu`) is enabled automatically. This delivers the best performance for int8 workloads.
77

88
## Imperative Mode
99
```python
@@ -91,11 +91,11 @@ modelJit(inference_data)
9191
print(modelJit.graph_for(inference_dta))
9292
```
9393

94-
We need define QConfig for TorchScript module, use `prepare_jit` for inserting observer and use `convert_jit` for replacing FP32 modules.
94+
We need to define `QConfig`` for TorchScript module, use `prepare_jit` for inserting observer and use `convert_jit` for replacing FP32 modules.
9595

9696
Before `prepare_jit`, create a ScriptModule using `torch.jit.script` or `torch.jit.trace`. `jit.trace` is recommended for capable of catching the whole graph in most scenarios.
9797

98-
Fusion ops like conv_unary, conv_binary, linear_unary (e.g. `conv_relu`, `conv_sum_relu`) are automatically enabled after model conversion (`convert_jit`). A warmup stage is required for bringing the fusion into effect. With the benefit from fusion, ScriptModule can deliver better performance than eager mode. Hence, we recommend using ScriptModule as for performance consideration.
98+
Fusion operations like `conv_unary`, `conv_binary`, `linear_unary` (e.g. `conv_relu`, `conv_sum_relu`) are automatically enabled after model conversion (`convert_jit`). A warmup stage is required for bringing the fusion into effect. With the benefit from fusion, ScriptModule can deliver better performance than eager mode. Hence, we recommend using ScriptModule as for performance consideration.
9999

100100
`modelJit.graph_for(input)` is useful to dump the inference graph and other graph related information for performance analysis.
101101

docs/tutorials/features/profiler_kineto.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@ Kineto Supported Profiler Tool (Experimental)
33

44
## Introduction
55

6-
The Kineto supported profiler tool is an extension of PyTorch\* profiler for profiling operators' executing time cost on GPU devices. With this tool, users can get information in many fields of the run models or code scripts. User should build Intel® Extension for PyTorch\* with Kineto support as default and enable this tool by a `with` statement before the code segment.
6+
The Kineto supported profiler tool is an extension of PyTorch\* profiler for profiling operators' executing time cost on GPU devices. With this tool, you can get information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch\* with Kineto support as default and enable this tool using the `with` statement before the code segment.
77

88
## Use Case
99

1010
To use the Kineto supported profiler tool, you need to build Intel® Extension for PyTorch\* from source or install it via prebuilt wheel. You also have various methods to disable this tool.
1111

1212
### Build Tool
1313

14-
The build option `USE_KINETO` is switched on as default but you can switch it off via setting `USE_KINETO=OFF` while building Intel® Extension for PyTorch\* from source. Besides, an affiliated build option `USE_ONETRACE` will be automatically switched on following the build option `USE_KINETO`. With `USE_KINETO=OFF`, no Kineto related profiler code will be compiled and all python scripts using Kineto supported profiler with XPU backend will not work. In this case, you can still keep using profiler on CPU backend.
14+
The build option `USE_KINETO` is switched on by default but you can switch it off via setting `USE_KINETO=OFF` while building Intel® Extension for PyTorch\* from source. Besides, an affiliated build option `USE_ONETRACE` will be automatically switched on following the build option `USE_KINETO`. With `USE_KINETO=OFF`, no Kineto related profiler code will be compiled and all python scripts using Kineto supported profiler with XPU backend will not work. In this case, you can still keep using profiler on CPU backend.
1515

1616
Some affiliated build options are defined for choosing different tracing tools. Currently, only onetrace tool is supported. Configure `USE_KINETO=ON` and `USE_ONETRACE=OFF` will not enable Kineto support in Intel® Extension for PyTorch\* on GPU.
1717

docs/tutorials/features/torch_compile_gpu.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ torch.compile for GPU
55

66
Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph compilation capabilities for optimal PyTorch model performance on Intel GPU via the flagship [torch.compile](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile) API through the default "inductor" backend ([TorchInductor](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747/1)). The Triton compiler has been the core of the Inductor codegen supporting various accelerator devices. Intel has extended TorchInductor by adding Intel GPU support to Triton. Additionally, post-op fusions for convolution and matrix multiplication, facilitated by oneDNN fusion kernels, contribute to enhanced efficiency for computational intensive operations. Leveraging these features is as simple as using the default "inductor" backend, making it easier than ever to unlock the full potential of your PyTorch models on Intel GPU platforms.
77

8-
`torch.compile` for GPU is an experimental feature and available from 2.1.10. So far, the feature is functional on Intel® GPU Max Series.
8+
**Note**: `torch.compile` for GPU is an experimental feature and available from 2.1.10. So far, the feature is functional on Intel® GPU Max Series.
99

1010
### Inferenece with torch.compile
1111

0 commit comments

Comments
 (0)