苹果M1「徒有其表」？「地表最强」芯只能剪视频引知乎热议

新智元

发布于 2021-10-22 11:15:27

2.1K0

发布于 2021-10-22 11:15:27

文章被收录于专栏：新智元

新智元报道

来源：网络

编辑：好困小咸鱼

【新智元导读】5nm工艺，570亿晶体管，70%CPU性能提升，4倍GPU性能提升。号称史上最强芯片的M1 Max，只能「剪剪视频」？

最近，苹果开了一个芯片新品发布会。

光看参数，M1 Pro和M1 Max两款芯片确实太顶了！

M1 Pro，晶体管面积达到245mm²，内置337亿个晶体管，是M1的2倍多。

而M1 Max更夸张，搭载570亿个晶体管，比Pro还要大70%，芯片面积达到432mm²。

M1 Pro和M1 Max均采用大小核设计，最多10个核心，包括8个高性能内核和2个高效内核，CPU的性能直接比前代M1芯片提升70%。

GPU方面，M1 Pro采用最多16个核心，性能比M1芯片的GPU高出两倍。

而M1 Max一举将GPU的核心数量干到32个，算力可以达到恐怖的10.4TFLOPs，比M1的GPU还要再快4倍！

10TFLOPs，这个数字有点熟悉啊？

对GPU性能敏感的朋友可能联想到了，空气显卡公司Nvidia的RTX 2080给出的GPU参考性能也是这个数字。

	M1	M1 Pro	M1 Pro	M1 Max	M1 Max
GPU核心数	8	14	16	24	32
Teraflops	2.6	4.5	5.2	7.8	10.4
AMD GPU	RX 560 (2.6TF)	RX 5500M (4.6TF)	RX 5500 (5.2TF)	RX 5700M (7.9TF)	RX Vega 56 (10.5TF)
Nvidia GPU	GTX 1650(2.9TF)	GTX 1650 Super (4.4TF)RTX3050-75W（4.4TF	GTX 1660 Ti (5.4TF)	RTX 2070(7.4TF)	RTX 2080 (10TF)RTX3060-80W（10.94TF）

现在深度学习这么火，要不让M1系列的芯片和RTX 2080比试比试？

M1 VS 2080Ti

提到深度学习框架无非就是TensorFlow和PyTorch。

然而，这俩一直以来都只支持在NVIDIA的GPU上使用CUDA加速。而苹果用户只能在CPU上慢慢跑。

不过，苹果在2020年11月推出了采用M1芯片的Mac之后，很快，TensorFlow也出了2.4版本更新，支持在M1的GPU上训练神经网络。

https://machinelearning.apple.com/updates/ml-compute-training-on-mac

「TensorFlow 2.4的tensorflow_macos利用ML Compute，使机器学习库不仅能充分利用CPU，还能充分利用M1和英特尔驱动的Mac中的GPU，大幅提高训练性能。」

说得这么nice，到底怎么样，还是要实践才知道。

鉴于搭载M1 Pro和M1 Max的最新款Macbook Pro还未开售，就先用他们的小弟M1代替他们出场吧。M1的GPU最高可以跑到2.6TFLOPs，差不多是Nvidia RTX 2080独显的四分之一。

先在fashion-MNIST数据集上，训练一个小的三层全连接网络试试。

#import libraries
import tensorflow as tf
import time

#download fashion mnist dataset
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

train_set_count = len(train_labels)
test_set_count = len(test_labels)

#setup start time
t0 = time.time()

#normalize images
train_images = train_images / 255.0
test_images = test_images / 255.0

#create ML model
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

#compile ML model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

#train ML model
model.fit(train_images, train_labels, epochs=10)

#evaluate ML model on test set
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

#setup stop time
t1 = time.time()
total_time = t1-t0

#print results
print('\n')
print(f'Training set contained {train_set_count} images')
print(f'Testing set contained {test_set_count} images')
print(f'Model achieved {test_acc:.2f} testing accuracy')
print(f'Training and testing took {total_time:.2f} seconds')

测试开始，先在一台搭载Intel i7-9700K，拥有32GB内存，以及一张Nvidia RTX 2080Ti独立显卡的Linux系统电脑上运行上面的代码。

很快，就得到了结果：训练和测试花了7.78秒。

接着，用搭载M1处理器（8个CPU核心，8个GPU核心，16个神经引擎核心）和8GB内存的Mac Mini训练模型。

结果非常amazing啊！

训练和测试仅仅耗时6.70秒，比RTX 2080Ti的GPU还要快14%！这就有点厉害了。

但说实话，fashion-MNIST分类这种任务有点过于简单了，如果想在更大的数据集上，训练更强大的模型呢？

所以，得给它们来点更难的任务，分别用M1和RTX 2080Ti在Cifar10数据集上训练一个常用的ResNet50分类模型如何？

#import libraries
import tensorflow as tf
from time import perf_counter

#download cifar10 dataset
cifar10 = tf.keras.datasets.cifar10
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

train_set_count = len(train_labels)
test_set_count = len(test_labels)

#setup start time
t1_start = perf_counter()

#normalize images
train_images = train_images / 255.0
test_images = test_images / 255.0

#create ML model using tensorflow provided ResNet50 model, note the [32, 32, 3] shape because that's the shape of cifar
model = tf.keras.applications.ResNet50(
    include_top=True, weights=None, input_tensor=None,
    input_shape=(32, 32, 3), pooling=None, classes=10
)

# CIFAR 10 labels have one integer for each image (between 0 and 10)
# We want to perform a cross entropy which requires a one hot encoded version e.g: [0.0, 0.0, 1.0, 0.0, 0.0...]
train_labels = tf.one_hot(train_labels.reshape(-1), depth=10, axis=-1)

# Do the same thing for the test labels
test_labels = tf.one_hot(test_labels.reshape(-1), depth=10, axis=-1)

#compile ML model, use non sparse version here because there is no sparse data.
model.compile(optimizer='adam',
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

#train ML model
model.fit(train_images,  train_labels, epochs=10)

#evaluate ML model on test set
test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

#setup stop time
t1_stop = perf_counter()
total_time = t1_stop-t1_start

#print results
print('\n')
print(f'Training set contained {train_set_count} images')
print(f'Testing set contained {test_set_count} images')
print(f'Model achieved {test_acc:.2f} testing accuracy')
print(f'Training and testing took {total_time:.2f} seconds')

测试再次开始，在RTX 2080Ti运行新代码，表现非常不错。