首頁手記從頭構建PyTorch（帶有GPU支持和自動微分）

從頭構建PyTorch（帶有GPU支持和自動微分）

標簽：

機器學習深度學習人工智能

基于C/C++、CUDA和Python构建自己的深度学习框架，支持GPU和自动微分

图片由作者使用AI（https://copilot.microsoft.com/images/create）协助创建

简介

多年来，我一直使用PyTorch来构建和训练深度学习模型。尽管我已经熟悉了它的语法和规则，但总有一些东西让我感到好奇：在这些操作内部到底发生了什么？这一切是如何运作的？

如果你已经看到了这里，你可能也有同样的疑问。如果你问我如何在PyTorch中创建和训练一个模型，你可能会想到类似下面的代码：

    导入 torch  
    导入 torch.nn 作为 nn  
    导入 torch.optim 作为 optim  

    类 MyModel(nn.Module):  
        def __init__(self):  
            超类(MyModel, self).__init__()  
            self.fc1 = nn.Linear(1, 10)  
            self.sigmoid = nn.Sigmoid()  
            self.fc2 = nn.Linear(10, 1)  

        def forward(self, x):  
            out = self.fc1(x)  
            out = self.sigmoid(out)  
            out = self.fc2(out)  

            返回 out  

    ...  

    model = MyModel().to(device)  
    criterion = nn.MSELoss()  
    optimizer = optim.SGD(model.parameters(), lr=0.001)  

    对于 epoch 在 范围(epochs):  
        对于 x, y 在 ...  

            x = x.to(device)  
            y = y.to(device)  

            outputs = model(x)  
            loss = criterion(outputs, y)  

            optimizer.zero_grad()  
            loss.backward()  
            optimizer.step()

但是如果你问我反向传播是如何工作的呢？或者，比如说，当你重塑一个张量时会发生什么？数据会在内部重新排列吗？这是如何发生的？为什么PyTorch如此快速？PyTorch是如何处理GPU操作的？这些问题一直让我感到好奇，我想你也一样。因此，为了更好地理解这些概念，还有什么比从零开始构建你自己的张量库更好的方法呢？这就是你在本文中将要学习的内容！

1 — 张量

为了构建一个张量库，你首先需要了解的概念显然是：什么是张量？

你可能有一个直观的概念，即张量是一个包含某些数字的 n 维数据结构的数学概念。但在这里我们需要从计算的角度理解如何构建这种数据结构。我们可以将张量视为由数据本身及其一些元数据组成，这些元数据描述了张量的一些方面，例如其形状或它所在的设备（即 CPU 内存、GPU 内存等）。

图片由作者提供

还有一个你可能从未听说过的不太流行的元数据，称为步长。理解张量数据重新排列的内部机制，这个概念非常重要，所以我们需要稍微详细地讨论一下。

想象一个形状为 [4, 8] 的二维张量，如下所示。

4x8 张量（作者绘制的图像）

张量的数据（即浮点数）实际上是以一维数组的形式存储在内存中的：

1-D 维张量的数据数组（作者提供）

所以，为了将这个一维数组表示为N维张量，我们使用步长。基本思路如下：

我们有一个4行8列的矩阵。考虑到它的所有元素都是按行组织在1维数组中，如果我们想要访问位置[2, 3]的值，我们需要遍历2行（每行8个元素）再加上3个位置。用数学语言来说，我们需要在1维数组中遍历3 + 2 * 8个元素：

图片由作者提供

所以这个“8”是第二个维度的stride。在这种情况下，它表示我需要在数组中跨越多少个元素才能在第二个维度上“跳”到其他位置。

因此，对于访问形状为 [shape_0, shape_1] 的二维张量的元素 [i, j]，我们基本上需要访问位置为 j + i * shape_1 的元素。

现在，让我们想象一个三维张量：

5x4x8 张量（作者绘制的图像）

你可以将这个三维张量视为一系列矩阵。例如，你可以将这个 [5, 4, 8] 张量视为 5 个形状为 [4, 8] 的矩阵。

现在，为了访问位置 [1, 2, 7] 的元素，你需要遍历一个形状为 [4,8] 的完整矩阵，8 个元素的 2 行，以及 1 个元素的 7 列。因此，你需要在 1 维数组中遍历 (1 4 8) + (2 8) + (7 1) 个位置。

图片由作者提供

因此，要访问形状为 [shape_0, shape_1, shape_2] 的3-D张量在1-D数据数组中的元素 [i][j][k]，你应该这样做：

这个 shape_1 * shape_2 是第一维度的 stride，shape_2 是第二维度的 stride，1 是第三维度的 stride。

然后，为了泛化：

每个维度的stride 可以通过计算下一个维度张量形状的乘积来得出：

然后我们将 stride[n-1] 设置为 1。

在我们形状为 [5, 4, 8] 的张量示例中，步长 strides = [4*8, 8, 1] = [32, 8, 1]

你可以自行测试：

    import torch  

    torch.rand([5, 4, 8]).stride()  
    #(32, 8, 1)

好的，但为什么我们需要形状和步长？除了访问存储为一维数组的N维张量的元素之外，这个概念还可以非常轻松地用于操纵张量的排列。

例如，要重塑一个张量，你只需要设置新的形状并根据新的形状计算新的步长！（因为新的形状保证了元素的数量相同）

    导入 torch  

    t = torch.rand([5, 4, 8])  

    打印(t.shape)  
    # [5, 4, 8]  

    打印(t.stride())  
    # [32, 8, 1]  

    new_t = t.reshape([4, 5, 2, 2, 2])  

    打印(new_t.shape)  
    # [4, 5, 2, 2, 2]  

    打印(new_t.stride())  
    # [40, 8, 4, 2, 1]

内部，张量仍然存储为相同的1维数组。reshape方法并没有改变数组中元素的顺序！这真是太棒了，不是吗？ 😁

你可以通过以下访问 PyTorch 内部一维数组的函数自行验证：

    import ctypes  

    def 打印内部(t: torch.Tensor):  
        print(  
            torch.frombuffer(  
                ctypes.string_at(t.data_ptr(), t.storage().nbytes()), dtype=t.dtype  
            )  
        )  

    打印内部(t)  
    # [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...  

    打印内部(new_t)  
    # [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...

或者比如说，你想交换两个轴。内部实现中，你只需要交换相应的步长！

    t = torch.arange(0, 24).reshape(2, 3, 4)  
    print(t)  
    # [[[ 0,  1,  2,  3],  
    #   [ 4,  5,  6,  7],  
    #   [ 8,  9, 10, 11]],  

    #  [[12, 13, 14, 15],  
    #   [16, 17, 18, 19],  
    #   [20, 21, 22, 23]]]  

    print(t.shape)  
    # [2, 3, 4]  

    print(t.stride())  
    # [12, 4, 1]  

    new_t = t.transpose(0, 1)  
    print(new_t)  
    # [[[ 0,  1,  2,  3],  
    #   [12, 13, 14, 15]],  

    #  [[ 4,  5,  6,  7],  
    #   [16, 17, 18, 19]],  

    #  [[ 8,  9, 10, 11],  
    #   [20, 21, 22, 23]]]  

    print(new_t.shape)  
    # [3, 2, 4]  

    print(new_t.stride())  
    # [4, 12, 1]

如果你打印内部数组，两者具有相同的值：

    print_internal(t)  
    # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]  

    print_internal(new_t)  
    # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

然而，new_t 的步长现在与我上面展示的公式不符。这是由于张量现在不再连续。也就是说，虽然内部数组保持不变，但其值在内存中的顺序与张量的实际顺序不再匹配。

    t.is_contiguous()  
    # True  

    new_t.is_contiguous()  
    # False

这意味着按顺序访问非连续的元素效率较低（因为实际的张量元素在内存中并不是顺序排列的）。为了修复这个问题，我们可以这样做：

    new_t_contiguous = new_t.contiguous()  

    print(new_t_contiguous.is_contiguous())  
    # True

如果我们分析内部数组，它的顺序现在与实际张量的顺序一致，这可以提供更好的内存访问效率：

    print(new_t)  
    # [[[ 0,  1, 2, 3],  
    #   [12, 13, 14, 15]],  

    #  [[ 4, 5,  6,  7],  
    #   [16, 17, 18, 19]],  

    #  [[ 8, 9, 10, 11],  
    #   [20, 21, 22, 23]]]  

    print_internal(new_t)  
    # [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]  

    print_internal(new_t_contiguous)  
    # [ 0, 1, 2, 3, 12, 13, 14, 15, 4, 5, 6, 7, 16, 17, 18, 19, 8, 9, 10, 11, 20, 21, 22, 23]

现在我们理解了张量是如何建模的，让我们开始创建我们的库吧！

我会叫它 Norch ，这个名字意为 NOT PyTorch，同时也暗指我的姓氏 Nogueira 😁

首先要知道的是，尽管 PyTorch 是通过 Python 使用的，但其内部实际上是运行 C/C++ 代码。因此，我们将首先创建我们内部的 C/C++ 函数。

我们可以首先定义一个 tensor 作为结构体来存储其数据和元数据，并创建一个函数来实例化它：

    //norch/csrc/tensor.cpp  

    #include <stdio.h>  
    #include <stdlib.h>  
    #include <string.h>  
    #include <math.h>  

    typedef struct {  
        float* data;  
        int* strides;  
        int* shape;  
        int ndim;  
        int size;  
        char* device;  
    } Tensor;  

    Tensor* create_tensor(float* data, int* shape, int ndim) {  

        Tensor* tensor = (Tensor*)malloc(sizeof(Tensor));  
        if (tensor == NULL) {  
            fprintf(stderr, "内存分配失败\n");  
            exit(1);  
        }  
        tensor->data = data;  
        tensor->shape = shape;  
        tensor->ndim = ndim;  

        tensor->size = 1;  
        for (int i = 0; i < ndim; i++) {  
            tensor->size *= shape[i];  
        }  

        tensor->strides = (int*)malloc(ndim * sizeof(int));  
        if (tensor->strides == NULL) {  
            fprintf(stderr, "内存分配失败\n");  
            exit(1);  
        }  
        int stride = 1;  
        for (int i = ndim - 1; i >= 0; i--) {  
            tensor->strides[i] = stride;  
            stride *= shape[i];  
        }  

        return tensor;  
    }

为了访问某些元素，我们可以利用步长，就像我们之前学过的那样：

    //norch/csrc/tensor.cpp  

    float get_item(Tensor* tensor, int* indices) {  
        int index = 0;  
        for (int i = 0; i < tensor->ndim; i++) {  
            index += indices[i] * tensor->strides[i];  
        }  

        float result;  
        result = tensor->data[index];  

        return result;  
    }

现在，我们可以创建张量操作。我将展示一些示例，你可以在本文末尾链接的仓库中找到完整版本。

    //norch/csrc/cpu.cpp  

    void add_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {  

        for (int i = 0; i < tensor1->size; i++) {  
            result_data[i] = tensor1->data[i] + tensor2->data[i];  
        }  
    }  

    void sub_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {  

        for (int i = 0; i < tensor1->size; i++) {  
            result_data[i] = tensor1->data[i] - tensor2->data[i];  
        }  
    }  

    void elementwise_mul_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {  

        for (int i = 0; i < tensor1->size; i++) {  
            result_data[i] = tensor1->data[i] * tensor2->data[i];  
        }  
    }  

    void assign_tensor_cpu(Tensor* tensor, float* result_data) {  

        for (int i = 0; i < tensor->size; i++) {  
            result_data[i] = tensor->data[i];  
        }  
    }  

    ...

之后我们可以创建我们其他的张量函数，这些函数将调用这些操作：

    //norch/csrc/tensor.cpp  

    Tensor* add_tensor(Tensor* tensor1, Tensor* tensor2) {  
        if (tensor1->ndim != tensor2->ndim) {  
            fprintf(stderr, "张量必须具有相同的维度才能相加：%d 和 %d\n", tensor1->ndim, tensor2->ndim);  
            exit(1);  
        }  

        int ndim = tensor1->ndim;  
        int* shape = (int*)malloc(ndim * sizeof(int));  
        if (shape == NULL) {  
            fprintf(stderr, "内存分配失败\n");  
            exit(1);  
        }  

        for (int i = 0; i < ndim; i++) {  
            if (tensor1->shape[i] != tensor2->shape[i]) {  
                fprintf(stderr, "张量必须具有相同的形状才能相加：%d 和 %d 在索引 %d\n", tensor1->shape[i], tensor2->shape[i], i);  
                exit(1);  
            }  
            shape[i] = tensor1->shape[i];  
        }          
        float* result_data = (float*)malloc(tensor1->size * sizeof(float));  
        if (result_data == NULL) {  
            fprintf(stderr, "内存分配失败\n");  
            exit(1);  
        }  
        add_tensor_cpu(tensor1, tensor2, result_data);  

        return create_tensor(result_data, shape, ndim, device);  
    }

如前所述，张量的重塑不会修改其内部数据数组：

    //norch/csrc/tensor.cpp  

    Tensor* reshape_tensor(Tensor* tensor, int* new_shape, int new_ndim) {  

        int ndim = new_ndim;  
        int* shape = (int*)malloc(ndim * sizeof(int));  
        if (shape == NULL) {  
            fprintf(stderr, "内存分配失败\n");  
            exit(1);  
        }  

        for (int i = 0; i < ndim; i++) {  
            shape[i] = new_shape[i];  
        }  

        // 计算新形状中的总元素数  
        int size = 1;  
        for (int i = 0; i < new_ndim; i++) {  
            size *= shape[i];  
        }  

        // 检查新形状中的总元素数是否与当前张量的大小匹配  
        if (size != tensor->size) {  
            fprintf(stderr, "无法重塑张量。新形状中的总元素数与当前张量大小不匹配。\n");  
            exit(1);  
        }  

        float* result_data = (float*)malloc(tensor->size * sizeof(float));  
        if (result_data == NULL) {  
            fprintf(stderr, "内存分配失败\n");  
            exit(1);  
        }  
        assign_tensor_cpu(tensor, result_data);  
        return create_tensor(result_data, shape, ndim, device);  
    }

虽然我们现在可以进行一些张量操作了，但没有人愿意用 C/C++ 来运行它吧？让我们开始构建我们的 Python 包装器吧！

有许多方法可以使用Python运行C/C++代码，例如 Pybind11 和 Cython. 在我们的示例中，我将使用 ctypes.

_ctypes_的基本结构如下所示：

    //C 代码  
    #include <stdio.h>  

    float add_floats(float a, float b) {  
        return a + b;  
    }

    # 编译  
    gcc -shared -o add_floats.so -fPIC add_floats.c

    # Python 代码  
    import ctypes  

    # 加载共享库  
    lib = ctypes.CDLL('./add_floats.so')  

    # 定义函数的参数和返回类型  
    lib.add_floats.argtypes = [ctypes.c_float, ctypes.c_float]  
    lib.add_floats.restype = ctypes.c_float  

    # 将 Python float 转换为 c_float 类型  
    a = ctypes.c_float(3.5)  
    b = ctypes.c_float(2.2)  

    # 调用 C 函数  
    result = lib.add_floats(a, b)  
    print(result)  
    # 5.7

如你所见，这非常直观。编译完 C/C++ 代码后，你可以非常轻松地使用 ctypes 在 Python 中调用它。你只需要定义函数的参数和返回值的 c_types，将变量转换为其相应的 c_types，并调用函数。对于更复杂的数据类型，如数组（浮点数列表），你可以使用指针。

    data = [1.0, 2.0, 3.0]  
    data_ctype = (ctypes.c_float * len(data))(*data)  

    lib.some_array_func.argtypes = [ctypes.POINTER(ctypes.c_float)]  

    ...  

    lib.some_array_func(data)

对于结构体类型，我们可以创建我们自己的 c_type：

    class CustomType(ctypes.Structure):  
        _fields_ = [  
            ('field1', ctypes.POINTER(ctypes.c_float)),  
            ('field2', ctypes.POINTER(ctypes.c_int)),  
            ('field3', ctypes.c_int),  
        ]  

    # 可以用作 ctypes.POINTER(CustomType)

经过这段简短的解释，让我们为我们的张量C/C++库构建Python包装器吧！

    # norch/tensor.py  

    import ctypes  

    class CTensor(ctypes.Structure):  
        _fields_ = [  
            ('data', ctypes.POINTER(ctypes.c_float)),  
            ('strides', ctypes.POINTER(ctypes.c_int)),  
            ('shape', ctypes.POINTER(ctypes.c_int)),  
            ('ndim', ctypes.c_int),  
            ('size', ctypes.c_int),  
        ]  

    class Tensor:  
        os.path.abspath(os.curdir)  
        _C = ctypes.CDLL("COMPILED_LIB.so"))  

        def __init__(self):  

            data, shape = self.flatten(data)  
            self.data_ctype = (ctypes.c_float * len(data))(*data)  
            self.shape_ctype = (ctypes.c_int * len(shape))(*shape)  
            self.ndim_ctype = ctypes.c_int(len(shape))  

            self.shape = shape  
            self.ndim = len(shape)  

            Tensor._C.create_tensor.argtypes = [ctypes.POINTER(ctypes.c_float), ctypes.POINTER(ctypes.c_int), ctypes.c_int]  
            Tensor._C.create_tensor.restype = ctypes.POINTER(CTensor)  

            self.tensor = Tensor._C.create_tensor(  
                self.data_ctype,  
                self.shape_ctype,  
                self.ndim_ctype,  
            )  

        def flatten(self, nested_list):  
            """  
            该方法将一个嵌套列表类型的张量转换为展平后的张量及其形状  

            示例:  

            参数:    
                nested_list: [[1, 2, 3], [-5, 2, 0]]  
            返回:  
                flat_data: [1, 2, 3, -5, 2, 0]  
                shape: [2, 3]  
            """  
            def flatten_recursively(nested_list):  
                flat_data = []  
                shape = []  
                if isinstance(nested_list, list):  
                    for sublist in nested_list:  
                        inner_data, inner_shape = flatten_recursively(sublist)  
                        flat_data.extend(inner_data)  
                    shape.append(len(nested_list))  
                    shape.extend(inner_shape)  
                else:  
                    flat_data.append(nested_list)  
                return flat_data, shape  

            flat_data, shape = flatten_recursively(nested_list)  
            return flat_data, shape

现在我们将 Python 中的张量操作包含进来以调用 C/C++ 操作。

    # norch/tensor.py  

    def __getitem__(self, indices):  
        """  
        通过索引访问张量 tensor[i, j, k...]  
        """  

        if len(indices) != self.ndim:  
            raise ValueError("索引的数量必须与维度的数量匹配")  

        Tensor._C.get_item.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int)]  
        Tensor._C.get_item.restype = ctypes.c_float  

        indices = (ctypes.c_int * len(indices))(*indices)  
        value = Tensor._C.get_item(self.tensor, indices)    

        return value  

    def reshape(self, new_shape):  
        """  
        重塑张量  
        result = tensor.reshape([1,2])  
        """  
        new_shape_ctype = (ctypes.c_int * len(new_shape))(*new_shape)  
        new_ndim_ctype = ctypes.c_int(len(new_shape))  

        Tensor._C.reshape_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int), ctypes.c_int]  
        Tensor._C.reshape_tensor.restype = ctypes.POINTER(CTensor)  
        result_tensor_ptr = Tensor._C.reshape_tensor(self.tensor, new_shape_ctype, new_ndim_ctype)     

        result_data = Tensor()  
        result_data.tensor = result_tensor_ptr  
        result_data.shape = new_shape.copy()  
        result_data.ndim = len(new_shape)  
        result_data.device = self.device  

        return result_data  

    def __add__(self, other):  
        """  
        加法张量  
        result = tensor1 + tensor2  
        """  

        if self.shape != other.shape:  
            raise ValueError("张量的形状必须相同才能进行加法操作")  

        Tensor._C.add_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(CTensor)]  
        Tensor._C.add_tensor.restype = ctypes.POINTER(CTensor)  

        result_tensor_ptr = Tensor._C.add_tensor(self.tensor, other.tensor)  

        result_data = Tensor()  
        result_data.tensor = result_tensor_ptr  
        result_data.shape = self.shape.copy()  
        result_data.ndim = self.ndim  
        result_data.device = self.device  

        return result_data  

    # 包括其他操作：  
    # __str__  
    # __sub__ (-)  
    # __mul__ (*)  
    # __matmul__ (@)  
    # __pow__ (**)  
    # __truediv__ (/)  
    # log  
    # ...

如果你看到了这里，你现在就可以运行代码并开始进行一些张量操作了！

    import norch  

    tensor1 = norch.Tensor([[1, 2, 3], [3, 2, 1]])  
    tensor2 = norch.Tensor([[3, 2, 1], [1, 2, 3]])  

    result = tensor1 + tensor2  
    print(result[0, 0])  
    # 4

2 — GPU 支持

在创建了我们库的基本结构之后，现在我们将把它提升到一个新的水平。众所周知，你可以通过调用 .to("cuda") 将数据发送到GPU，从而更快地运行数学运算。我假设你对CUDA的工作原理有一定的了解，但如果你不了解，可以阅读我的另一篇文章：CUDA教程。我在这里等着你。😊

请提供需要翻译的具体内容。

对于那些急于了解的读者，这里有一个简单的介绍：

基本上，到目前为止我们所有的代码都是在CPU内存上运行的。虽然对于单个操作来说，CPU的速度更快，但GPU的优势在于其并行化能力。CPU的设计目标是快速执行一系列操作（线程），但只能同时执行几十个（线程）。而GPU的设计目标则是并行执行数百万个操作（通过牺牲单个线程的性能）。

因此，我们可以利用这种能力来并行执行操作。例如，在进行一百万规模的张量加法时，我们不再需要在循环中逐个索引顺序相加，而是可以使用GPU一次性并行完成所有加法操作。为了实现这一点，我们可以使用CUDA，这是NVIDIA开发的一个平台，旨在帮助开发者将GPU支持集成到他们的软件应用程序中。

为了做到这一点，你可以使用 CUDA C/C++，这是一种基于 C/C++ 的简单接口，设计用于运行特定的 GPU 操作（例如从 CPU 内存复制数据到 GPU 内存）。

下面的代码基本上使用了一些CUDA C/C++函数将数据从CPU复制到GPU，并并行运行AddTwoArrays函数（也称为内核），总共使用N个GPU线程，每个线程负责添加数组中的不同元素。

    #include <stdio.h>  

    // CPU版本，用于比较  
    void AddTwoArrays_CPU(float A[], float B[], float C[]) {  
        for (int i = 0; i < N; i++) {  
            C[i] = A[i] + B[i];  
        }  
    }  

    // 核函数定义  
    __global__ void AddTwoArrays_GPU(float A[], float B[], float C[]) {  
        int i = threadIdx.x;  
        C[i] = A[i] + B[i];  
    }  

    int main() {  

        int N = 1000; // 数组的大小  
        float A[N], B[N], C[N]; // 数组 A, B 和 C  

        ...  

        float *d_A, *d_B, *d_C; // 设备上的数组 A, B 和 C 的指针  

        // 在设备上为数组 A, B 和 C 分配内存  
        cudaMalloc((void **)&d_A, N * sizeof(float));  
        cudaMalloc((void **)&d_B, N * sizeof(float));  
        cudaMalloc((void **)&d_C, N * sizeof(float));  

        // 将数组 A 和 B 从主机复制到设备  
        cudaMemcpy(d_A, A, N * sizeof(float), cudaMemcpyHostToDevice);  
        cudaMemcpy(d_B, B, N * sizeof(float), cudaMemcpyHostToDevice);  

        // 使用 N 个线程调用核函数  
        AddTwoArrays_GPU<<<1, N>>>(d_A, d_B, d_C);  

        // 将向量 C 从设备复制回主机  
        cudaMemcpy(C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);  

    }

正如你所注意到的，我们没有为每一对元素添加操作，而是将所有添加操作并行运行，消除了循环指令。

在这一简短介绍之后，我们可以回到我们的张量库。

第一步是创建一个函数，用于将张量数据从CPU发送到GPU，反之亦然。

    //norch/csrc/tensor.cpp  

    void to_device(Tensor* tensor, char* target_device) {  
        if ((strcmp(target_device, "cuda") == 0) && (strcmp(tensor->device, "cpu") == 0)) {  
            cpu_to_cuda(tensor);  
        }  

        else if ((strcmp(target_device, "cpu") == 0) && (strcmp(tensor->device, "cuda") == 0)) {  
            cuda_to_cpu(tensor);  
        }  
    }

    //norch/csrc/cuda.cu  

    __host__ void cpu_to_cuda(Tensor* tensor) {  

        float* data_tmp;  
        cudaMalloc((void **)&data_tmp, tensor->size * sizeof(float));  
        cudaMemcpy(data_tmp, tensor->data, tensor->size * sizeof(float), cudaMemcpyHostToDevice);  

        tensor->data = data_tmp;  

        const char* device_str = "cuda";  
        tensor->device = (char*)malloc(strlen(device_str) + 1);  
        strcpy(tensor->device, device_str);   

        printf("成功将张量发送到: %s\n", tensor->device);  
    }  

    __host__ void cuda_to_cpu(Tensor* tensor) {  
        float* data_tmp = (float*)malloc(tensor->size * sizeof(float));  

        cudaMemcpy(data_tmp, tensor->data, tensor->size * sizeof(float), cudaMemcpyDeviceToHost);  
        cudaFree(tensor->data);  

        tensor->data = data_tmp;  

        const char* device_str = "cpu";  
        tensor->device = (char*)malloc(strlen(device_str) + 1);  
        strcpy(tensor->device, device_str);   

        printf("成功将张量发送到: %s\n", tensor->device);  
    }

Python封装：

    # norch/tensor.py  

    def to(self, device):  
        self.device = device  
        self.device_ctype = self.device.encode('utf-8')  

        Tensor._C.to_device.argtypes = [ctypes.POINTER(CTensor), ctypes.c_char_p]  
        Tensor._C.to_device.restype = None  
        Tensor._C.to_device(self.tensor, self.device_ctype)  

        return self

然后，我们为所有的张量操作创建GPU版本。我将为加法和减法操作编写示例：

    //norch/csrc/cuda.cu  

    #define THREADS_PER_BLOCK 128  

    __global__ void add_tensor_cuda_kernel(float* data1, float* data2, float* result_data, int size) {  

        int i = blockIdx.x * blockDim.x + threadIdx.x;  
        if (i < size) {  
            result_data[i] = data1[i] + data2[i];  
        }  
    }  

    __host__ void add_tensor_cuda(Tensor* tensor1, Tensor* tensor2, float* result_data) {  

        int number_of_blocks = (tensor1->size + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;  
        add_tensor_cuda_kernel<<<number_of_blocks, THREADS_PER_BLOCK>>>(tensor1->data, tensor2->data, result_data, tensor1->size);  

        cudaError_t error = cudaGetLastError();  
        if (error != cudaSuccess) {  
            printf("CUDA error: %s\n", cudaGetErrorString(error));  
            exit(-1);  
        }  

        cudaDeviceSynchronize();  
    }  

    __global__ void sub_tensor_cuda_kernel(float* data1, float* data2, float* result_data, int size) {  

        int i = blockIdx.x * blockDim.x + threadIdx.x;  
        if (i < size) {  
            result_data[i] = data1[i] - data2[i];  
        }  
    }  

    __host__ void sub_tensor_cuda(Tensor* tensor1, Tensor* tensor2, float* result_data) {  

        int number_of_blocks = (tensor1->size + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;  
        sub_tensor_cuda_kernel<<<number_of_blocks, THREADS_PER_BLOCK>>>(tensor1->data, tensor2->data, result_data, tensor1->size);  

        cudaError_t error = cudaGetLastError();  
        if (error != cudaSuccess) {  
            printf("CUDA error: %s\n", cudaGetErrorString(error));  
            exit(-1);  
        }  

        cudaDeviceSynchronize();  
    }  

    ...

随后，我们在 tensor.cpp 中包含了一个新的张量属性 char* device，我们可以使用它来选择操作将在哪里运行（CPU或GPU）：

    //norch/csrc/tensor.cpp  

    Tensor* add_tensor(Tensor* tensor1, Tensor* tensor2) {  
        if (tensor1->ndim != tensor2->ndim) {  
            fprintf(stderr, "张量必须具有相同的维度，分别为 %d 和 %d，才能进行加法操作\n", tensor1->ndim, tensor2->ndim);  
            exit(1);  
        }  

        if (strcmp(tensor1->device, tensor2->device) != 0) {  
            fprintf(stderr, "张量必须位于相同的设备上：%s 和 %s\n", tensor1->device, tensor2->device);  
            exit(1);  
        }  

        char* device = (char*)malloc(strlen(tensor1->device) + 1);  
        if (device != NULL) {  
            strcpy(device, tensor1->device);  
        } else {  
            fprintf(stderr, "内存分配失败\n");  
            exit(-1);  
        }  
        int ndim = tensor1->ndim;  
        int* shape = (int*)malloc(ndim * sizeof(int));  
        if (shape == NULL) {  
            fprintf(stderr, "内存分配失败\n");  
            exit(1);  
        }  

        for (int i = 0; i < ndim; i++) {  
            if (tensor1->shape[i] != tensor2->shape[i]) {  
                fprintf(stderr, "张量必须具有相同的形状，分别为 %d 和 %d，才能在索引 %d 进行加法操作\n", tensor1->shape[i], tensor2->shape[i], i);  
                exit(1);  
            }  
            shape[i] = tensor1->shape[i];  
        }          

        if (strcmp(tensor1->device, "cuda") == 0) {  

            float* result_data;  
            cudaMalloc((void **)&result_data, tensor1->size * sizeof(float));  
            add_tensor_cuda(tensor1, tensor2, result_data);  
            return create_tensor(result_data, shape, ndim, device);  
        }   
        else {  
            float* result_data = (float*)malloc(tensor1->size * sizeof(float));  
            if (result_data == NULL) {  
                fprintf(stderr, "内存分配失败\n");  
                exit(1);  
            }  
            add_tensor_cpu(tensor1, tensor2, result_data);  
            return create_tensor(result_data, shape, ndim, device);  
        }       
    }

现在我们的库支持GPU了！

    import norch  

    tensor1 = norch.Tensor([[1, 2, 3], [3, 2, 1]]).to("cuda")  
    tensor2 = norch.Tensor([[3, 2, 1], [1, 2, 3]]).to("cuda")  

    result = tensor1 + tensor2

3 — 自动求导（Autograd）

PyTorch 很受欢迎的一个主要原因是因为它的 Autograd 模块。这是一个核心组件，它允许自动求导，以便计算梯度（这对于使用诸如梯度下降之类的优化算法训练模型至关重要）。通过调用单个方法 .backward()，它可以计算所有先前张量操作的梯度：

    x = torch.tensor([[1., 2, 3], [3., 2, 1]], requires_grad=True)  
    # [[1,  2,  3],  
    #  [3, 2., 1]]  

    y = torch.tensor([[3., 2, 1], [1., 2, 3]], requires_grad=True)  
    # [[3,  2, 1],  
    #  [1, 2, 3]]  

    L = ((x - y) ** 3).sum()  

    L.backward()  

    # 可以访问 x 和 y 的梯度  
    print(x.grad)  
    # [[12, 0, 12],  
    #  [12, 0, 12]]  

    print(y.grad)  
    # [[-12, 0, -12],  
    #  [-12, 0, -12]]  

    # 为了最小化 z，可以使用梯度下降法：  
    # x = x - learning_rate * x.grad  
    # y = y - learning_rate * y.grad

为了理解发生了什么，让我们尝试手动复制相同的步骤：

让我们先计算：

注意，x 是一个矩阵，因此我们需要分别计算 L 对每个元素的偏导数。此外，L 是所有元素的总和，但重要的是要记住，对于每个元素来说，其他元素不会对其偏导数产生影响。因此，我们得到以下项：

通过为每一项应用链式法则，我们对外部函数求导并乘以内部函数的导数：

哪里：

最后：

因此，我们有以下最终方程来计算 L 对 x 的导数：

将值代入方程：

计算结果，我们得到了与 PyTorch 计算相同的结果：

现在，让我们来分析一下我们刚刚做了什么：

基本上，我们观察到了所有按保留顺序涉及的操作：求和、3次幂和减法。然后，我们应用了链式法则，计算了每个操作的导数，并递归地计算了下一个操作的导数。因此，首先我们需要实现不同数学操作的导数：

对于加法：

    # norch/autograd/functions.py  

    class AddBackward:  
        def __init__(self, x, y):  
            self.input = [x, y]  

        def backward(self, gradient):  
            return [gradient, gradient]

对于正弦函数:

    # norch/autograd/functions.py  

    class SinBackward:  
        def __init__(self, x):  
            self.input = [x]  

        def backward(self, gradient):  
            x = self.input[0]  
            return [x.cos() * gradient]

对于余弦：

    # norch/autograd/functions.py  

    class CosBackward:  
        def __init__(self, x):  
            self.input = [x]  

        def backward(self, gradient):  
            x = self.input[0]  
            return [-x.sin() * gradient]

对于元素-wise 乘法：

    # norch/autograd/functions.py  

    class ElementwiseMulBackward:  
        def __init__(self, x, y):  
            self.input = [x, y]  

        def backward(self, gradient):  
            x = self.input[0]  
            y = self.input[1]  
            return [y * gradient, x * gradient]

对于求和：

    # norch/autograd/functions.py  

    class SumBackward:  
        def __init__(self, x):  
            self.input = [x]  

        def backward(self, gradient):  
            # 由于求和将张量缩减为标量，梯度会被广播以匹配原始形状。  
            return [float(gradient.tensor.contents.data[0]) * self.input[0].ones_like()]

您可以在文章末尾访问 GitHub 仓库链接以探索其他操作。

现在我们已经有了每个操作的导数表达式，可以继续实现递归反向链式法则。我们可以为我们的张量设置一个 requires_grad 参数，以指示我们希望存储该张量的梯度。如果为真，我们将为每个张量操作存储梯度。例如：

    # norch/tensor.py  

    def __add__(self, other):  

      if self.shape != other.shape:  
          raise ValueError("张量必须具有相同的形状才能进行加法操作")  

      Tensor._C.add_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(CTensor)]  
      Tensor._C.add_tensor.restype = ctypes.POINTER(CTensor)  

      result_tensor_ptr = Tensor._C.add_tensor(self.tensor, other.tensor)  

      result_data = Tensor()  
      result_data.tensor = result_tensor_ptr  
      result_data.shape = self.shape.copy()  
      result_data.ndim = self.ndim  
      result_data.device = self.device  

      result_data.requires_grad = self.requires_grad or other.requires_grad  
      if result_data.requires_grad:  
          result_data.grad_fn = AddBackward(self, other)

然后实现 .backward() 方法：

    # norch/tensor.py  

    def backward(self, gradient=None):  
        if not self.requires_grad:  
            return  

        if gradient is None:  
            if self.shape == [1]:  
                gradient = Tensor([1]) # dx/dx = 1 的情况  
            else:  
                raise RuntimeError("对于非标量张量，必须指定梯度参数。")  

        if self.grad is None:  
            self.grad = gradient  

        else:  
            self.grad += gradient  

        if self.grad_fn is not None: # 不是叶子节点  
            grads = self.grad_fn.backward(gradient) # 调用操作的反向传播  
            for tensor, grad in zip(self.grad_fn.input, grads):  
                if isinstance(tensor, Tensor):  
                    tensor.backward(grad) # 递归调用反向传播，以处理梯度表达式（链式法则）

最后，只需实现 .zero_grad() 以将张量的梯度置零，以及 .detach() 以移除张量的自动求导历史：

    # norch/tensor.py

    def zero_grad(self):  
        self.grad = None  

    def detach(self):  
        self.grad = None  
        self.grad_fn = None

恭喜！你刚刚创建了一个带有GPU支持和自动微分功能的完整张量库！现在我们可以创建nn和optim模块，以便更轻松地训练一些深度学习模型。

4 — nn 和 optim 模块

nn 是一个用于构建神经网络和深度学习模型的模块，而 optim 与用于训练这些模型的优化算法相关。为了重现这些功能，首先要实现一个 Parameter，它只是一个可训练的张量，具有相同的操作，但 requires_grad 始终设置为 True，并且使用一些随机初始化技术。

    # norch/nn/parameter.py  

    from norch.tensor import Tensor  
    from norch.utils import utils  
    import random  

    class Parameter(Tensor):  
        """  
        参数是一个可训练的张量。  
        """  
        def __init__(self, shape):  
            data = utils.generate_random_list(shape=shape)  
            super().__init__(data, requires_grad=True)

    # norch/utisl/utils.py  

    def 生成随机列表(shape):  
        """  
        生成具有随机数字和形状为 'shape' 的列表  
        [4, 2] --> [[rand1, rand2], [rand3, rand4], [rand5, rand6], [rand7, rand8]]  
        """  
        if len(shape) == 0:  
            return []  
        else:  
            inner_shape = shape[1:]  
            if len(inner_shape) == 0:  
                return [random.uniform(-1, 1) for _ in range(shape[0])]  
            else:  
                return [生成随机列表(inner_shape) for _ in range(shape[0])]

通过使用参数，我们可以开始构建模块：

    # norch/nn/module.py  

    from .parameter import Parameter  
    from collections import OrderedDict  
    from abc import ABC  
    import inspect  

    class Module(ABC):  
        """  
        模块的抽象类  
        """  
        def __init__(self):  
            self._modules = OrderedDict()  
            self._params = OrderedDict()  
            self._grads = OrderedDict()  
            self.training = True  

        def forward(self, *inputs, **kwargs):  
            raise NotImplementedError  

        def __call__(self, *inputs, **kwargs):  
            return self.forward(*inputs, **kwargs)  

        def train(self):  
            self.training = True  
            for param in self.parameters():  
                param.requires_grad = True  

        def eval(self):  
            self.training = False  
            for param in self.parameters():  
                param.requires_grad = False  

        def parameters(self):  
            for name, value in inspect.getmembers(self):  
                if isinstance(value, Parameter):  
                    yield self, name, value  
                elif isinstance(value, Module):  
                    yield from value.parameters()  

        def modules(self):  
            yield from self._modules.values()  

        def gradients(self):  
            for module in self.modules():  
                yield module._grads  

        def zero_grad(self):  
            for _, _, parameter in self.parameters():  
                parameter.zero_grad()  

        def to(self, device):  
            for _, _, parameter in self.parameters():  
                parameter.to(device)  

            return self  

        def inner_repr(self):  
            return ""  

        def __repr__(self):  
            string = f"{self.get_name()}("  
            tab = "   "  
            modules = self._modules  
            if modules == {}:  
                string += f'\n{tab}(parameters): {self.inner_repr()}'  
            else:  
                for key, module in modules.items():  
                    string += f"\n{tab}({key}): {module.get_name()}({module.inner_repr()})"  
            return f'{string}\n)'  

        def get_name(self):  
            return self.__class__.__name__  

        def __setattr__(self, key, value):  
            self.__dict__[key] = value  

            if isinstance(value, Module):  
                self._modules[key] = value  
            elif isinstance(value, Parameter):  
                self._params[key] = value

例如，我们可以继承自 nn.Module 来构建我们自己的自定义模块，或者我们可以使用一些之前创建的模块，例如 linear，它实现了 y = W x + b 操作。

    # norch/nn/modules/linear.py  

    from ..module import Module  
    from ..parameter import Parameter  

    class Linear(Module):  
        def __init__(self, input_dim, output_dim):  
            super().__init__()  
            self.input_dim = input_dim  
            self.output_dim = output_dim  
            self.weight = Parameter(shape=[self.output_dim, self.input_dim])  
            self.bias = Parameter(shape=[self.output_dim, 1])  

        def forward(self, x):  
            z = self.weight @ x + self.bias  
            return z  

        def inner_repr(self):  
            return f"input_dim={self.input_dim}, output_dim={self.output_dim}, " \  
                   f"bias={True if self.bias is not None else False}"

现在我们可以实现一些损失和激活函数。例如，均方误差损失和Sigmoid函数：

    # norch/nn/loss.py  

    from .module import Module  

    class MSELoss(Module):  
        def __init__(self):  
            pass  

        def forward(self, predictions, labels):  
            assert labels.shape == predictions.shape, \  
                "标签和预测的形状不匹配：{} 和 {}".format(labels.shape, predictions.shape)  

            return ((predictions - labels) ** 2).sum() / predictions.numel  

        def __call__(self, *inputs):  
            return self.forward(*inputs)

    # norch/nn/activation.py  

    from .module import Module  
    import math  

    class Sigmoid(Module):  
        def __init__(self):  
            super().__init__()  

        def forward(self, x):  
            return 1.0 / (1.0 + (math.e) ** (-x))

最后，创建优化器。在我的示例中，我将实现随机梯度下降算法：

    # norch/optim/optimizer.py  

    from abc import ABC  
    from norch.tensor import Tensor  

    class Optimizer(ABC):  
        """  
        优化器的抽象类  
        """  

        def __init__(self, parameters):  
            if isinstance(parameters, Tensor):  
                raise TypeError("parameters 应该是一个可迭代对象，但得到了 {}".format(type(parameters)))  
            elif isinstance(parameters, dict):  
                parameters = parameters.values()  

            self.parameters = list(parameters)  

        def step(self):  
            raise NotImplementedError  

        def zero_grad(self):  
            for module, name, parameter in self.parameters:  
                parameter.zero_grad()  

    class SGD(Optimizer):  
        def __init__(self, parameters, lr=1e-1, momentum=0):  
            super().__init__(parameters)  
            self.lr = lr  
            self.momentum = momentum  
            self._cache = {'velocity': [p.zeros_like() for (_, _, p) in self.parameters]}  

        def step(self):  
            for i, (module, name, _) in enumerate(self.parameters):  
                parameter = getattr(module, name)  

                velocity = self._cache['velocity'][i]  

                velocity = self.momentum * velocity - self.lr * parameter.grad  

                updated_parameter = parameter + velocity  

                setattr(module, name, updated_parameter)  

                self._cache['velocity'][i] = velocity  

                parameter.detach()  
                velocity.detach()

而且，就这样！我们刚刚创建了自己的深度学习框架！ 🥳

让我们进行一些训练：

    import norch  
    import norch.nn as nn  
    import norch.optim as optim  
    import random  
    import math  

    random.seed(1)  

    class MyModel(nn.Module):  
        def __init__(self):  
            super(MyModel, self).__init__()  
            self.fc1 = nn.Linear(1, 10)  
            self.sigmoid = nn.Sigmoid()  
            self.fc2 = nn.Linear(10, 1)  

        def forward(self, x):  
            out = self.fc1(x)  
            out = self.sigmoid(out)  
            out = self.fc2(out)  

            return out  

    device = "cuda"  
    epochs = 10  

    model = MyModel().to(device)  
    criterion = nn.MSELoss()  
    optimizer = optim.SGD(model.parameters(), lr=0.001)  
    loss_list = []  

    x_values = [0. , 0.4, 0.8, 1.2, 1.6, 2. , 2.4, 2.8, 3.2, 3.6, 4. ,  
            4.4, 4.8, 5.2, 5.6, 6. , 6.4, 6.8, 7.2, 7.6, 8. , 8.4,  
            8.8, 9.2, 9.6, 10. , 10.4, 10.8, 11.2, 11.6, 12. , 12.4, 12.8,  
           13.2, 13.6, 14. , 14.4, 14.8, 15.2, 15.6, 16. , 16.4, 16.8, 17.2,  
           17.6, 18. , 18.4, 18.8, 19.2, 19.6, 20.]  

    y_true = []  
    for x in x_values:  
        y_true.append(math.pow(math.sin(x), 2))  

    for epoch in range(epochs):  
        for x, target in zip(x_values, y_true):  
            x = norch.Tensor([[x]]).T  
            target = norch.Tensor([[target]]).T  

            x = x.to(device)  
            target = target.to(device)  

            outputs = model(x)  
            loss = criterion(outputs, target)  

            optimizer.zero_grad()  
            loss.backward()  
            optimizer.step()  

        print(f'Epoch [{epoch + 1}/{epochs}], Loss: {loss[0]:.4f}')  
        loss_list.append(loss[0])  

    # Epoch [1/10], Loss: 1.7035  
    # Epoch [2/10], Loss: 0.7193  
    # Epoch [3/10], Loss: 0.3068  
    # Epoch [4/10], Loss: 0.1742  
    # Epoch [5/10], Loss: 0.1342  
    # Epoch [6/10], Loss: 0.1232  
    # Epoch [7/10], Loss: 0.1220  
    # Epoch [8/10], Loss: 0.1241  
    # Epoch [9/10], Loss: 0.1270  
    # Epoch [10/10], Loss: 0.1297

图片由作者提供

模型使用我们自定义的深度学习框架成功创建并训练完成了！

你可以在这里查看完整的代码 here。

结论

在这篇文章中，我们介绍了张量是什么、它是如何建模的以及更高级的主题，如CUDA和Autograd的基本概念。我们成功地创建了一个支持GPU和自动微分的深度学习框架。希望这篇文章能帮助你简要地了解PyTorch内部的工作原理。

在未来的文章中，我将尝试涵盖更多高级主题，例如分布式训练（多节点/多GPU）和内存管理。请在评论中告诉我您的想法，或者告诉我您希望我在下一篇文章中写些什么！非常感谢您的阅读！😊

也关注我在这里和我的 LinkedIn个人资料，以了解我最新的文章！

参考文献

PyNorch — 该项目的 GitHub 仓库。

教程 CUDA — 一篇简短的介绍CUDA工作原理的文章。

PyTorch — PyTorch 文档。

MartinLwx的博客 — 介绍了步长的教程。

Stride 教程 — 另一个关于 strides 的教程。

PyTorch 内部结构 — 介绍 PyTorch 的结构指南。

Nets — 使用NumPy重现的PyTorch网络。

Autograd — 实时编码的 Autograd 库演示。

點擊查看更多內容

為 TA 點贊

若覺得本文不錯，就分享一下吧！

評論

評論

共同學習，寫下你的評論

評論加載中...

展開查看更多評論

作者其他優質文章

正在加載中

慕萊塢森

手記
篇

粉絲

36

獲贊與收藏

146

關注作者，訂閱最新文章

閱讀免費教程

后端通用面試教程

41個小節 32253 360

網絡編程入門教程

20個小節 13299 250

Pandas 入門教程

25個小節 19918 373

推薦

評論

收藏

共同學習，寫下你的評論



感謝您的支持，我會繼續努力的～

掃碼打賞，你說多少就多少

贊賞金額會直接到老師賬戶

支付方式

打開微信掃一掃，即可進行掃碼打賞哦

今天注冊有機會得

100積分直接送

付費專欄免費學

大額優惠券免費領

立即參與放棄機會

點擊
抽獎

慕課手記新用戶專享福利

恭喜你，你的運氣太好了，居然抽中了 100個積分！

恭喜你，抽中了價值元的專欄！

太棒了，直接落到你賬戶里！

積分商城里的羅技鼠標、機械鍵盤、
Kindle 閱讀器、小米平衡車
Apple iPad （10.2英寸）、大額優惠券
在等著你去兌換了噢

作者：

免費贈送

兌換碼：1111222211 復制

優惠券可用于購買實戰課、體系課
無門檻使用

先去看看，有什么好東西馬上兌換我愛學習，選課去


亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

從頭構建PyTorch（帶有GPU支持和自動微分）

閱讀免費教程