Reviewed and edited doc updates (#3592)

avladimi · web-flow · commit a12f9f650f64 · 2023-12-07T11:52:07.000+08:00
diff --git a/docs/tutorials/features/FSDP.md b/docs/tutorials/features/FSDP.md
@@ -3,14 +3,14 @@ Fully Sharded Data Parallel (FSDP)
 
 ## Introduction
 
-`Fully Sharded Data Parallel (FSDP)` is a PyTorch\* module that provide industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training, this makes the training of some large-scale models feasible. Please refer to [FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) for an introduction to FSDP.
+`Fully Sharded Data Parallel (FSDP)` is a PyTorch\* module that provides industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training. This makes the training of some large-scale models feasible. Please refer to [FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) for an introduction to FSDP.
 
-To run FSDP on GPU, similar to DDP, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing collectives as `AllGather`, `ReduceScatter` etc. needed by FSDP. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
-The Installation steps of Intel® oneCCL Bindings for Pytorch\* follows the same steps as DDP. 
+To run FSDP on GPU, similar to DDP, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing collectives such as `AllGather`, `ReduceScatter`, and other needed by FSDP. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL.
+To install Intel® oneCCL Bindings for Pytorch\*, follow the same installation steps as for DDP. 
 
 ## FSDP Usage (GPU only)
 
-FSDP follows its usage in PyTorch. To use FSDP with Intel® Extension for PyTorch\*, make the following modifications to your model script:
+FSDP is designed to align with PyTorch conventions. To use FSDP with Intel® Extension for PyTorch\*, make the following modifications to your model script:
 
 1. Import the necessary packages.
 ```python
@@ -25,7 +25,7 @@ from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 dist.init_process_group(backend='ccl')
 ``` 
 
-3. For FSDP with each process exclusively works on a single GPU, set the device ID as `local rank`.
+3. For FSDP with each process exclusively working on a single GPU, set the device ID as `local rank`.
 ```python
 torch.xpu.set_device("xpu:{}".format(rank))
 # or
@@ -39,13 +39,13 @@ model = model.to(device)
 model = FSDP(model, device_id=device)
 ```
 
-Note: for FSDP with XPU, you need to specify `device_ids` with XPU device, otherwise it will trigger the CUDA path and throw error.
+**Note**: for FSDP with XPU, you need to specify `device_ids` with XPU device; otherwise, it will trigger the CUDA path and throw an error.
 
-## Example Usage:
+## Example
 
-Here's an example based on [PyTorch FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) to illustrate the usage of FSDP on XPU and the necessary changes from a CUDA case to XPU case.
+Here's an example based on [PyTorch FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) to illustrate the usage of FSDP on XPU and the necessary changes to switch from CUDA to an XPU case.
 
-1. Import necessary packages
+1. Import necessary packages:
 
 ```python
 """
@@ -82,7 +82,7 @@ from torch.distributed.fsdp.wrap import (
 )
 ```
 
-2. Distributed training setup
+2. Set up distributed training:
 
 ```python
 """
@@ -99,7 +99,7 @@ def cleanup():
     dist.destroy_process_group()
 ```
 
-3. Define the toy model for handwritten digit classification.
+3. Define the toy model for handwritten digit classification:
 
 ```python
 class Net(nn.Module):
@@ -129,7 +129,7 @@ class Net(nn.Module):
         return output
 ```
 
-4. Define a train function
+4. Define a training function:
 
 ```python
 """
@@ -156,7 +156,7 @@ def train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler
         print('Train Epoch: {} \tLoss: {:.6f}'.format(epoch, ddp_loss[0] / ddp_loss[1]))
 ```
 
-5. Define a validation function
+5. Define a validation function:
 
 ```python
 """
@@ -185,7 +185,7 @@ def test(model, rank, world_size, test_loader):
             100. * ddp_loss[1] / ddp_loss[2]))
 ```
 
-6. Define a distributed train function that wraps the model in FSDP
+6. Define a distributed training function that wraps the model in FSDP:
 
 ```python
 """
@@ -256,7 +256,7 @@ def fsdp_main(rank, world_size, args):
     cleanup()
 ```
 
-7. Finally parse the arguments and set the main function
+7. Finally, parse the arguments and set the main function:
 
 ```python
 """
@@ -292,9 +292,7 @@ if __name__ == '__main__':
         join=True)
 ```
 
-8. Running command
-
-Put the above code snippets to a python script “FSDP_mnist_xpu.py”, and run:
+8. Put the above code snippets to a python script `FSDP_mnist_xpu.py`, and run:
 
 ```bash
 python FSDP_mnist_xpu.py
diff --git a/docs/tutorials/features/float8.md b/docs/tutorials/features/float8.md
@@ -1,21 +1,21 @@
-Float8 datatype support [GPU] (Experimental)
+Float8 Data Type Support [GPU] (Experimental)
 ============================================
 
-## Float8 DataType
+## Float8 Data Type
 
-Float8 (FP8) is 8-bit floating point which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.
+Float8 (FP8) is a 8-bit floating point data type, which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.
 
 Two formats are used in FP8 training and inference, in order to meet the required value range and precision of activation, weight and gradient in Deep Neural Network (DNN). One is E4M3 (sign-exponent-mantissa) for activation and weight, the other is E5M2 for gradients. These two formats are defined in [FP8 FORMATS FOR DEEP LEARNING](https://arxiv.org/pdf/2209.05433.pdf).
 
-FP8 data type is used for memory storage only in current stage. It will be converted to BFloat16 data type for computation.
+FP8 data type is used for memory storage only in current stage. It will be converted to the BFloat16 data type for computation.
 
 ## FP8 Quantization
 
 On GPU, online Dynamic Quantization is used for FP8 data compression and decompression. Delayed Scaling algorithm is used for accelerating the quantizaiton process.
 
 ## Supported running mode
 
-Both DNN Training and Inference are supported with FP8 data type.
+Both DNN Training and Inference are supported with the FP8 data type.
 
 ## Supported operators
 
diff --git a/docs/tutorials/features/int4.md b/docs/tutorials/features/int4.md
@@ -1,9 +1,9 @@
-INT4 inference [GPU] (Experimentatal)
+INT4 inference [GPU] (Experimental)
 =====================================
 
-## INT4 DataType
+## INT4 Data Type
 
-INT4 is 4-bit fixed point which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.
+INT4 is a 4-bit fixed point data type, which is used to reduce memory footprint, improve the computation efficiency, and save power in Deep Learning domain.
 
 INT4 data type is being used in weight only quantization in current stage. It will be converted to Float16 data type for computation.
 
diff --git a/docs/tutorials/features/int8_overview_xpu.md b/docs/tutorials/features/int8_overview_xpu.md
@@ -1,9 +1,9 @@
-Intel® Extension for PyTorch\* optimizations for quantization [GPU]
+Intel® Extension for PyTorch\* Optimizations for Quantization [GPU]
 ===================================================================
 
-Intel® Extension for PyTorch\* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This tutorial illustrates the work flow of quantization on Intel GPUs.
+Intel® Extension for PyTorch\* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This section illustrates the quantization workflow on Intel GPUs.
 
-The overall view is that our usage follows the API defined in official PyTorch. Therefore, only small modification like moving model and data to GPU with to('xpu') is required. We highly recommend using the TorchScript for quantizing models. With graph model created via TorchScript, optimization like operator fusion (e.g. `conv_relu`) would be enabled automatically. This would deliver the best performance for int8 workloads.
+The overall view is that our usage follows the API defined in official PyTorch. Therefore, only small modification like moving model and data to GPU with `to('xpu')` is required. We highly recommend using the TorchScript for quantizing models. With graph model created via TorchScript, optimization like operator fusion (e.g. `conv_relu`) is enabled automatically. This delivers the best performance for int8 workloads.
 
 ## Imperative Mode
 ```python
@@ -91,11 +91,11 @@ modelJit(inference_data)
 print(modelJit.graph_for(inference_dta))
 ```
 
-We need define QConfig for TorchScript module, use `prepare_jit` for inserting observer and use `convert_jit` for replacing FP32 modules.
+We need to define `QConfig`` for TorchScript module, use `prepare_jit` for inserting observer and use `convert_jit` for replacing FP32 modules.
 
 Before `prepare_jit`, create a ScriptModule using `torch.jit.script` or `torch.jit.trace`. `jit.trace` is recommended for capable of catching the whole graph in most scenarios.
 
-Fusion ops like conv_unary, conv_binary, linear_unary (e.g. `conv_relu`, `conv_sum_relu`) are automatically enabled after model conversion (`convert_jit`). A warmup stage is required for bringing the fusion into effect. With the benefit from fusion, ScriptModule can deliver better performance than eager mode. Hence, we recommend using ScriptModule as for performance consideration.
+Fusion operations like `conv_unary`, `conv_binary`, `linear_unary` (e.g. `conv_relu`, `conv_sum_relu`) are automatically enabled after model conversion (`convert_jit`). A warmup stage is required for bringing the fusion into effect. With the benefit from fusion, ScriptModule can deliver better performance than eager mode. Hence, we recommend using ScriptModule as for performance consideration.
 
 `modelJit.graph_for(input)` is useful to dump the inference graph and other graph related information for performance analysis.
 
diff --git a/docs/tutorials/features/profiler_kineto.md b/docs/tutorials/features/profiler_kineto.md
@@ -3,15 +3,15 @@ Kineto Supported Profiler Tool (Experimental)
 
 ## Introduction
 
-The Kineto supported profiler tool is an extension of PyTorch\* profiler for profiling operators' executing time cost on GPU devices. With this tool, users can get information in many fields of the run models or code scripts. User should build Intel® Extension for PyTorch\* with Kineto support as default and enable this tool by a `with` statement before the code segment.
+The Kineto supported profiler tool is an extension of PyTorch\* profiler for profiling operators' executing time cost on GPU devices. With this tool, you can get information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch\* with Kineto support as default and enable this tool using the `with` statement before the code segment.
 
 ## Use Case
 
 To use the Kineto supported profiler tool, you need to build Intel® Extension for PyTorch\* from source or install it via prebuilt wheel. You also have various methods to disable this tool.
 
 ### Build Tool
 
-The build option `USE_KINETO` is switched on as default but you can switch it off via setting `USE_KINETO=OFF` while building Intel® Extension for PyTorch\* from source. Besides, an affiliated build option `USE_ONETRACE` will be automatically switched on following the build option `USE_KINETO`. With `USE_KINETO=OFF`, no Kineto related profiler code will be compiled and all python scripts using Kineto supported profiler with XPU backend will not work. In this case, you can still keep using profiler on CPU backend.
+The build option `USE_KINETO` is switched on by default but you can switch it off via setting `USE_KINETO=OFF` while building Intel® Extension for PyTorch\* from source. Besides, an affiliated build option `USE_ONETRACE` will be automatically switched on following the build option `USE_KINETO`. With `USE_KINETO=OFF`, no Kineto related profiler code will be compiled and all python scripts using Kineto supported profiler with XPU backend will not work. In this case, you can still keep using profiler on CPU backend.
 
 Some affiliated build options are defined for choosing different tracing tools. Currently, only onetrace tool is supported. Configure `USE_KINETO=ON` and `USE_ONETRACE=OFF` will not enable Kineto support in Intel® Extension for PyTorch\* on GPU.
 
diff --git a/docs/tutorials/features/torch_compile_gpu.md b/docs/tutorials/features/torch_compile_gpu.md
@@ -5,7 +5,7 @@ torch.compile for GPU
 
 Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph compilation capabilities for optimal PyTorch model performance on Intel GPU via the flagship [torch.compile](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile) API through the default "inductor" backend ([TorchInductor](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747/1)). The Triton compiler has been the core of the Inductor codegen supporting various accelerator devices. Intel has extended TorchInductor by adding Intel GPU support to Triton. Additionally, post-op fusions for convolution and matrix multiplication, facilitated by oneDNN fusion kernels, contribute to enhanced efficiency for computational intensive operations. Leveraging these features is as simple as using the default "inductor" backend, making it easier than ever to unlock the full potential of your PyTorch models on Intel GPU platforms.
 
-`torch.compile` for GPU is an experimental feature and available from 2.1.10. So far, the feature is functional on Intel® GPU Max Series.
+**Note**: `torch.compile` for GPU is an experimental feature and available from 2.1.10. So far, the feature is functional on Intel® GPU Max Series.
 
 ### Inferenece with torch.compile
 
diff --git a/examples/gpu/inference/python/llm/README.md b/examples/gpu/inference/python/llm/README.md