You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tutorials/features/FSDP.md
+16-18Lines changed: 16 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -3,14 +3,14 @@ Fully Sharded Data Parallel (FSDP)
3
3
4
4
## Introduction
5
5
6
-
`Fully Sharded Data Parallel (FSDP)` is a PyTorch\* module that provide industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training, this makes the training of some large-scale models feasible. Please refer to [FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) for an introduction to FSDP.
6
+
`Fully Sharded Data Parallel (FSDP)` is a PyTorch\* module that provides industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training. This makes the training of some large-scale models feasible. Please refer to [FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) for an introduction to FSDP.
7
7
8
-
To run FSDP on GPU, similar to DDP, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing collectives as `AllGather`, `ReduceScatter` etc. needed by FSDP. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
9
-
The Installation steps of Intel® oneCCL Bindings for Pytorch\* follows the same steps as DDP.
8
+
To run FSDP on GPU, similar to DDP, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing collectives such as `AllGather`, `ReduceScatter`, and other needed by FSDP. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
9
+
To install Intel® oneCCL Bindings for Pytorch\*, follow the same installation steps as for DDP.
10
10
11
11
## FSDP Usage (GPU only)
12
12
13
-
FSDP follows its usage in PyTorch. To use FSDP with Intel® Extension for PyTorch\*, make the following modifications to your model script:
13
+
FSDP is designed to align with PyTorch conventions. To use FSDP with Intel® Extension for PyTorch\*, make the following modifications to your model script:
14
14
15
15
1. Import the necessary packages.
16
16
```python
@@ -25,7 +25,7 @@ from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
25
25
dist.init_process_group(backend='ccl')
26
26
```
27
27
28
-
3. For FSDP with each process exclusively works on a single GPU, set the device ID as `local rank`.
28
+
3. For FSDP with each process exclusively working on a single GPU, set the device ID as `local rank`.
29
29
```python
30
30
torch.xpu.set_device("xpu:{}".format(rank))
31
31
# or
@@ -39,13 +39,13 @@ model = model.to(device)
39
39
model = FSDP(model, device_id=device)
40
40
```
41
41
42
-
Note: for FSDP with XPU, you need to specify `device_ids` with XPU device, otherwise it will trigger the CUDA path and throw error.
42
+
**Note**: for FSDP with XPU, you need to specify `device_ids` with XPU device; otherwise, it will trigger the CUDA path and throw an error.
43
43
44
-
## Example Usage:
44
+
## Example
45
45
46
-
Here's an example based on [PyTorch FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) to illustrate the usage of FSDP on XPU and the necessary changes from a CUDA case to XPU case.
46
+
Here's an example based on [PyTorch FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) to illustrate the usage of FSDP on XPU and the necessary changes to switch from CUDA to an XPU case.
47
47
48
-
1. Import necessary packages
48
+
1. Import necessary packages:
49
49
50
50
```python
51
51
"""
@@ -82,7 +82,7 @@ from torch.distributed.fsdp.wrap import (
82
82
)
83
83
```
84
84
85
-
2.Distributed training setup
85
+
2.Set up distributed training:
86
86
87
87
```python
88
88
"""
@@ -99,7 +99,7 @@ def cleanup():
99
99
dist.destroy_process_group()
100
100
```
101
101
102
-
3. Define the toy model for handwritten digit classification.
102
+
3. Define the toy model for handwritten digit classification:
Copy file name to clipboardExpand all lines: docs/tutorials/features/float8.md
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -1,21 +1,21 @@
1
-
Float8 datatype support[GPU] (Experimental)
1
+
Float8 Data Type Support[GPU] (Experimental)
2
2
============================================
3
3
4
-
## Float8 DataType
4
+
## Float8 Data Type
5
5
6
-
Float8 (FP8) is 8-bit floating point which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.
6
+
Float8 (FP8) is a 8-bit floating point data type, which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.
7
7
8
8
Two formats are used in FP8 training and inference, in order to meet the required value range and precision of activation, weight and gradient in Deep Neural Network (DNN). One is E4M3 (sign-exponent-mantissa) for activation and weight, the other is E5M2 for gradients. These two formats are defined in [FP8 FORMATS FOR DEEP LEARNING](https://arxiv.org/pdf/2209.05433.pdf).
9
9
10
-
FP8 data type is used for memory storage only in current stage. It will be converted to BFloat16 data type for computation.
10
+
FP8 data type is used for memory storage only in current stage. It will be converted to the BFloat16 data type for computation.
11
11
12
12
## FP8 Quantization
13
13
14
14
On GPU, online Dynamic Quantization is used for FP8 data compression and decompression. Delayed Scaling algorithm is used for accelerating the quantizaiton process.
15
15
16
16
## Supported running mode
17
17
18
-
Both DNN Training and Inference are supported with FP8 data type.
18
+
Both DNN Training and Inference are supported with the FP8 data type.
Copy file name to clipboardExpand all lines: docs/tutorials/features/int4.md
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
-
INT4 inference [GPU] (Experimentatal)
1
+
INT4 inference [GPU] (Experimental)
2
2
=====================================
3
3
4
-
## INT4 DataType
4
+
## INT4 Data Type
5
5
6
-
INT4 is 4-bit fixed point which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.
6
+
INT4 is a 4-bit fixed point data type, which is used to reduce memory footprint, improve the computation efficiency, and save power in Deep Learning domain.
7
7
8
8
INT4 data type is being used in weight only quantization in current stage. It will be converted to Float16 data type for computation.
Intel® Extension for PyTorch\* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This tutorial illustrates the work flow of quantization on Intel GPUs.
4
+
Intel® Extension for PyTorch\* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This section illustrates the quantization workflow on Intel GPUs.
5
5
6
-
The overall view is that our usage follows the API defined in official PyTorch. Therefore, only small modification like moving model and data to GPU with to('xpu') is required. We highly recommend using the TorchScript for quantizing models. With graph model created via TorchScript, optimization like operator fusion (e.g. `conv_relu`) would be enabled automatically. This would deliver the best performance for int8 workloads.
6
+
The overall view is that our usage follows the API defined in official PyTorch. Therefore, only small modification like moving model and data to GPU with `to('xpu')` is required. We highly recommend using the TorchScript for quantizing models. With graph model created via TorchScript, optimization like operator fusion (e.g. `conv_relu`) is enabled automatically. This delivers the best performance for int8 workloads.
7
7
8
8
## Imperative Mode
9
9
```python
@@ -91,11 +91,11 @@ modelJit(inference_data)
91
91
print(modelJit.graph_for(inference_dta))
92
92
```
93
93
94
-
We need define QConfig for TorchScript module, use `prepare_jit` for inserting observer and use `convert_jit` for replacing FP32 modules.
94
+
We need to define `QConfig`` for TorchScript module, use `prepare_jit` for inserting observer and use `convert_jit` for replacing FP32 modules.
95
95
96
96
Before `prepare_jit`, create a ScriptModule using `torch.jit.script` or `torch.jit.trace`. `jit.trace` is recommended for capable of catching the whole graph in most scenarios.
97
97
98
-
Fusion ops like conv_unary, conv_binary, linear_unary (e.g. `conv_relu`, `conv_sum_relu`) are automatically enabled after model conversion (`convert_jit`). A warmup stage is required for bringing the fusion into effect. With the benefit from fusion, ScriptModule can deliver better performance than eager mode. Hence, we recommend using ScriptModule as for performance consideration.
98
+
Fusion operations like `conv_unary`, `conv_binary`, `linear_unary` (e.g. `conv_relu`, `conv_sum_relu`) are automatically enabled after model conversion (`convert_jit`). A warmup stage is required for bringing the fusion into effect. With the benefit from fusion, ScriptModule can deliver better performance than eager mode. Hence, we recommend using ScriptModule as for performance consideration.
99
99
100
100
`modelJit.graph_for(input)` is useful to dump the inference graph and other graph related information for performance analysis.
The Kineto supported profiler tool is an extension of PyTorch\* profiler for profiling operators' executing time cost on GPU devices. With this tool, users can get information in many fields of the run models or code scripts. User should build Intel® Extension for PyTorch\* with Kineto support as default and enable this tool by a`with` statement before the code segment.
6
+
The Kineto supported profiler tool is an extension of PyTorch\* profiler for profiling operators' executing time cost on GPU devices. With this tool, you can get information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch\* with Kineto support as default and enable this tool using the`with` statement before the code segment.
7
7
8
8
## Use Case
9
9
10
10
To use the Kineto supported profiler tool, you need to build Intel® Extension for PyTorch\* from source or install it via prebuilt wheel. You also have various methods to disable this tool.
11
11
12
12
### Build Tool
13
13
14
-
The build option `USE_KINETO` is switched on as default but you can switch it off via setting `USE_KINETO=OFF` while building Intel® Extension for PyTorch\* from source. Besides, an affiliated build option `USE_ONETRACE` will be automatically switched on following the build option `USE_KINETO`. With `USE_KINETO=OFF`, no Kineto related profiler code will be compiled and all python scripts using Kineto supported profiler with XPU backend will not work. In this case, you can still keep using profiler on CPU backend.
14
+
The build option `USE_KINETO` is switched on by default but you can switch it off via setting `USE_KINETO=OFF` while building Intel® Extension for PyTorch\* from source. Besides, an affiliated build option `USE_ONETRACE` will be automatically switched on following the build option `USE_KINETO`. With `USE_KINETO=OFF`, no Kineto related profiler code will be compiled and all python scripts using Kineto supported profiler with XPU backend will not work. In this case, you can still keep using profiler on CPU backend.
15
15
16
16
Some affiliated build options are defined for choosing different tracing tools. Currently, only onetrace tool is supported. Configure `USE_KINETO=ON` and `USE_ONETRACE=OFF` will not enable Kineto support in Intel® Extension for PyTorch\* on GPU.
Copy file name to clipboardExpand all lines: docs/tutorials/features/torch_compile_gpu.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ torch.compile for GPU
5
5
6
6
Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph compilation capabilities for optimal PyTorch model performance on Intel GPU via the flagship [torch.compile](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile) API through the default "inductor" backend ([TorchInductor](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747/1)). The Triton compiler has been the core of the Inductor codegen supporting various accelerator devices. Intel has extended TorchInductor by adding Intel GPU support to Triton. Additionally, post-op fusions for convolution and matrix multiplication, facilitated by oneDNN fusion kernels, contribute to enhanced efficiency for computational intensive operations. Leveraging these features is as simple as using the default "inductor" backend, making it easier than ever to unlock the full potential of your PyTorch models on Intel GPU platforms.
7
7
8
-
`torch.compile` for GPU is an experimental feature and available from 2.1.10. So far, the feature is functional on Intel® GPU Max Series.
8
+
**Note**: `torch.compile` for GPU is an experimental feature and available from 2.1.10. So far, the feature is functional on Intel® GPU Max Series.
0 commit comments