文章/答案/技术大牛

发布

社区首页 >专栏 >Your Guide to DL with MLSQL Stack (3)

Your Guide to DL with MLSQL Stack (3)

用户2936994

发布于 2019-05-17 06:40:22

6440

文章被收录于专栏：祝威廉祝威廉

This is the third article of Your Guide with MLSQL Stack series. We hope this article series shows you how MLSQL stack helps people do AI job.

As we have seen in the previous posts that MLSQL stack give you the power to use the built-in Algorithms and Python ML frameworks. The ability to use Python ML framework means you are totally free to use Deep Learning tools like PyTorch, Tensorflow. But this time, we will teach you how to use built-in DL framework called BigDL to accomplish image classification task first.

Requirements

This guide requires MLSQL Stack 1.3.0-SNAPSHOT. You can setup MLSQL stack with following links. We recommend you deploy MLSQL stack in local.

If you meet any problem when deploying, please let me know and feel free to address any issue in this link.

Project Structure

I have created a project named store1, and there is a directory called image_classify contains all mlsql script we talk today. It looks like this:

image.png

We will teach you how to build the project step by step.

Upload Image

First, download cifar10 raw images from url: https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz ungzip it and make sure it's a tar file.

Though MLSQL Console supports directory uploading, but the huge number of files in the directory will crash the uploading component in the web page, and of course, we hope we can fix this issue in future. Now, there is one way that packaging the directory as a tar file to walk around this uploading crash issue.

image.png

then save upload tar file to your home:

-- download cifar data from https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz
!fs -mkdir -p /tmp/cifar;
!saveUploadFileToHome /cifar.tar /tmp/cifar;

the console will show the real-time log which indicates that the system is extracting images.

image.png

This may take for a while because there are almost 60000 pictures.

Setup some paths.

We create a env.mlsql which contains variables path related:

set basePath="/tmp/cifar"; 
set labelMappingPath = "${basePath}/si";
set trainDataPath = "${basePath}/cifar_train_data";
set testDataPath = "${basePath}/cifar_test_data";
set modelPath = "${basePath}/bigdl";

And the other script will include this script to get all these paths.

Resize the pictures

We hope we can resize the images to 28*28, you can achieve it with ET ImageLoaderExt. Here are how we use it:

include store1.`alg.image_classify.env.mlsql`;

-- {} or {number} is used as parameter holder.
set imageResize='''
run command as ImageLoaderExt.`/tmp/cifar/cifar/{}` where 
and code="
    def apply(params:Map[String,String]) = {
         Resize(28, 28) ->
          MatToTensor() -> ImageFrameToSample()
      }
"
as {}
''';

-- train should be quoted because it's a keyword.
!imageResize "train" data;
!imageResize test testData;

In the above code, because we need to resize train and test dataset, in order to avoid duplicate code, we wrap the resize code as a command, then use this command to process train and test dataset separately.

Extract label

For example, When we see the following path we know that this picture contains frog. So we should extract frog from the path.

/tmp/cifar/cifar/train/38189_frog.png

Again, we wrap the SQL as a command and process the train and test data separately.

set extractLabel='''
-- convert image path to number label
select split(split(imageName,"_")[1],"\\.")[0] as labelStr,features from {} as {}
''';

!extractLabel data newdata;
!extractLabel testData newTestData;

We will convert the label to number and then plus 1(cause the bigdl needs the label starts from 1 instead of 0).

set numericLabel='''
train {0} as StringIndex.`/tmp/cifar/si` where inputCol="labelStr" and outputCol="labelIndex" as newdata1;
predict {0} as StringIndex.`/tmp/cifar/si` as newdata2;
select (cast(labelIndex as int) + 1) as label,features from newdata2 as {1}
''';

!numericLabel newdata trainData;
!numericLabel newTestData testData;

Save what we get until now

We will save all these data so we can use the processed data in future without executing repeatedly:

save overwrite trainData as parquet.`${trainDataPath}`;
save overwrite testData as parquet.`${testDataPath}`;

Train the images with DL

We create a new script file named classify_train.mlsql, and we should load the data first and convert the label to an array:

include store1.`alg.image_classify.env.mlsql`;

load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;

select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;

finally, we use our algorithm to train them:

train trainData as BigDLClassifyExt.`${modelPath}` where
disableSparkLog = "true"
and fitParam.0.featureSize="[3,28,28]"
and fitParam.0.classNum="10"
and fitParam.0.maxEpoch="300"

-- print evaluate message
and fitParam.0.evaluate.trigger.everyEpoch="true"
and fitParam.0.evaluate.batchSize="1000"
and fitParam.0.evaluate.table="testData"
and fitParam.0.evaluate.methods="Loss,Top1Accuracy"
-- for unbalanced class 
-- and fitParam.0.criterion.classWeight="[......]"
and fitParam.0.code='''
                   def apply(params:Map[String,String])={
                        val model = Sequential()
                        model.add(Reshape(Array(3, 28, 28), inputShape = Shape(28, 28, 3)))
                        model.add(Convolution2D(6, 5, 5, activation = "tanh").setName("conv1_5x5"))
                        model.add(MaxPooling2D())
                        model.add(Convolution2D(12, 5, 5, activation = "tanh").setName("conv2_5x5"))
                        model.add(MaxPooling2D())
                        model.add(Flatten())
                        model.add(Dense(100, activation = "tanh").setName("fc1"))
                        model.add(Dense(params("classNum").toInt, activation = "softmax").setName("fc2"))
                    }
'''
;

Int the code block, we use Keras-style code to build our model, and we tell our system some information e.g. how many classes and what's the feature size.

If this training stage takes too long, you can decrease fitParam.0.maxEpoch to a small value.

The console will print the message when training:

image.png

and finally the validate result:

image.png

Use model command to check the model train history:

!model history /tmp/cifar/bigdl;

Here are the result:

image.png

Register the model as a function

Since we have built our model, now let us learn how to predict the image. First, we load some data:

include store1.`alg.image_classify.env.mlsql`;

load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;

select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;

now, we can register the model as a function:

register BigDLClassifyExt.`${modelPath}` as cifarPredict;

finally, we can use the function to predict a new picture:

select
vec_argmax(cifarPredict(vec_dense(to_array_double(features)))) as predict_label,
label from testData limit 10 
as output;

Of course, you can predict a table:

predict testData as BigDLClassifyExt.`${modelPath}` as predictdata;

Why BigDL

GPU is very expensive and normally, our company already have lots of CPUs, if we can make full use of these CPUs which will save a lot of money.

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2019.05.16 ，如有侵权请联系 cloudcommunity@tencent.com 删除

sql

编程算法

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

sql

编程算法

登录后参与评论

暂无评论

编辑精选文章

换一批

万字详解高可用架构设计

9686

Go 开发者必备：Protocol Buffers 入门指南

亿级月活的社交 APP，陌陌如何做到 3 分钟定位故障？

6283

60页PPT全解：DeepSeek系列论文技术要点整理

7950

Your Guide to NLP with MLSQL Stack (一)

sql http 编程算法 xml php

MLSQL stack supports a complete pipeline of train/predict. This means the following steps can be in the same script:

用户2936994

2019/05/15

6290

Your Guide to Python with MLSQL Stack (二)

sql python http json

In the previous post Your Guide to NLP with MLSQL Stack (一), we already have known how to build a RandomForest model to classify text content. The TF/IDF, RandomForest are all built-in algorithms and implemented by Java. In this post, we will show you how to use Python to do the same job.

用户2936994

2019/05/14

6170

Your Guide to Python with MLSQL Stack (二)

MLSQL拥抱BigDL,轻轻松松无编码玩深度学习

sql 深度学习 python tensorflow keras

原谅我，前半句是真的，后半句是噱头，但是真的很简化了。 MLSQL已经有一个相对来比较完善的Python Runtime,细节可以参看这篇文章，所以玩深度学习是很容易的，不过需要你提供一段tensorflow代码或者项目。

用户2936994

2018/10/15

5150

新一代AI平台-MLSQL ，加入开源社区吧！

sql 编程算法 api 开源

MLSQL社区希望人人都能够参与进来。开源应该是普惠的，这种普惠应该是在价值的发挥上，以及社区的参与上。我们认为积极的社区参与体现在如下点：

木东居士

2020/04/26

1.1K0

写给【工程同学】的MLSQL机器学习教程

sql 机器学习神经网络深度学习人工智能

http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz

用户2936994

2022/04/25

4280

AI模型注册成MLSQL UDF函数示例

sql spark

训练一个Tensorflow模型下面的代码仅支持Console notebook模式下运行首先，准备minist数据集 include lib.`github.com/allwefantasy/lib-core` where force="true" and libMirror="gitee.com" and -- proxy configuration. alias="libCore"; -- dump minist data to object storage include

用户2936994

2022/07/21

3900

如何实现语法的自解释（MLSQL易用性设计有感）

其他

突然想明白了一件事，语法应该是自解释的。什么意思呢，就是用户需要有一个学习语法的语法，而这个语法应该极度简单，他只要花上一分钟，甚至依靠直觉就能知道怎么用，透过这个口，以点窥面，让用户具备自主学习其他语法的能力。

用户2936994

2018/09/29

5940

MLSQL Stack 让流调试更加简单

kafka sql 编程算法

有一位同学正在调研MLSQL Stack对流的支持。然后说了流调试其实挺困难的。经过实践，希望实现如下三点：

用户2936994

2019/06/05

4520

MLSQL解决了什么问题

大数据编程算法

在谈MLSQL解决了什么问题之前，我们先提一个“数据中台”的概念。什么是数据中台呢？数据中台至少应该具备如下三个特点：

用户2936994

2018/12/28

1K0

MLSQL解决了什么问题

sql 编程算法 api python

在谈MLSQL解决了什么问题之前，我们先提一个“数据中台”的概念。什么是数据中台呢？数据中台至少应该具备如下三个特点：

木东居士

2020/04/26

8620

用MLSQL完成简书图片备份

jar sql markdown scala

我今天正好想做两个事，第一个是，我想把我简书内容备份下来，但是官方提供的备份功能只能备份成markdown,然后发现图片没办法备份。所以我需要把我简书里的所有图片下载下来。

用户2936994

2019/04/01

5320

算法训练和模型部署如何避免多次重写数据预处理代码

其他

前段时间，我们对接算法的工程师哭丧的和我说，模型生成后一般都要部署成API的形态对外提供服务，但是算法工程师并没有提供如何将一条数据转化特征向量的方法，他能拿到的是代码逻辑以及一些“中间元数据”。数据预处理本来就复杂，翻译也是一件极其困难的事情。我解释了这件事情难以解决的原因，但是显然他还是有些失望。

用户1332428

2018/07/30

9160

MLSQL解决了什么问题

sql 编程算法 python 大数据

1、项目难以重现，可阅读性和环境要求导致能把另外一个同事写的python项目运行起来不得不靠运气

用户1332428

2018/07/30

8990

MLSQL-ET开发指南

sql html 网站 https 网络安全

MLSQL具备足够灵活的扩展性，能够同时解决 Data + AI 领域的问题。我们提供了大量的插件，方便用户在数据处理、商业分析和机器学习的不同场景中使用 MLSQL。这些插件类型包括: DataSource、ET、Script、App，我们都可以灵活的通过离线或者线上的方式注册到 MLSQL Engine 中使用。在 MLSQL 中，ET（Estimator/Transformer的简称）是一个非常重要的概念。通过 ET，我们可以完成非常多的复杂任务。包括：

用户2936994

2022/01/07

8890

为什么去开发一个MLSQL

其他

第一个，算法的着眼点是，用最快速的方式清洗一些数据出来，然后接着建模训练，评估预测效果，之后再重复清洗数据，再试验。因为很多算法工程师都是Python系的，对他们来说，最简单的方式自然是写python程序。一旦确认清洗方式后，这种数据清洗工作，最后研发工程师还要再重新用Spark去实现一遍。那么如果让算法工程师在做数据清洗的时候，直接使用PySpark呢？这样复用程度是不是可以有所提高？实际上是有的。但是算法工程师初期用起来会比较吃力，因为PySpark的学习成本还是有的，而且不小。

用户2936994

2018/08/27

7620

谷歌BigQuery ML VS StreamingPro MLSQL

sql 人工智能机器学习

今天看到了一篇 AI前线的文章谷歌BigQuery ML正式上岗，只会用SQL也能玩转机器学习！。正好自己也在力推 StreamingPro的MLSQL。今天就来对比下这两款产品。

用户2936994

2018/08/27

1.8K0

【1】paddle飞桨框架高层API使用讲解

api 神经网络深度学习人工智能批量计算

飞桨框架2.0全新推出高层API，是对飞桨API的进一步封装与升级，提供了更加简洁易用的API，进一步提升了飞桨的易学易用性，并增强飞桨的功能。

汀丶人工智能

2022/12/21

9930

手把手教你R语言随机森林使用

机器学习 r 语言

随机森林是常用的非线性用于构建分类器的算法，它是由数目众多的弱决策树构建成森林进而对结果进行投票判断标签的方法。

生信学习者

2024/06/11

1.2K0

MLSQL 对Python的支持之路

其他

Python是做机器学习框架一定要支持的。MLSQL很早就支持集成Python脚本做模型的训练和预测。

用户2936994

2018/10/11

8730

Byzer + OpenMLDB 实现端到端的，基于实时特征计算的机器学习流程

api 数据湖 csv curl 部署

本文示范如何使用OpenMLDB和 Byzer-lang 联合完成一个完整的机器学习应用。Byzer-lang 作为面向大数据和AI的一门语言，通过 Byzer-Notebook 和用户进行交互，用户可以轻松完成数据的抽取，ETL，特征/模型训练，保存，部署到最后预测等整个端到端的机器学习流程。OpenMLDB在本例中接收Byzer发送的指令和数据，完成数据的实时特征计算，并经特征工程处理后的数据集返回Byzer，供其进行后续的机器学习训练和预测。

用户2936994

2022/10/27

1.2K0