Loading [MathJax]/jax/output/CommonHTML/config.js
首页
学习
活动
专区
圈层
工具
发布
社区首页 >专栏 >Your Guide to DL with MLSQL Stack (3)

Your Guide to DL with MLSQL Stack (3)

作者头像
用户2936994
发布于 2019-05-17 06:40:22
发布于 2019-05-17 06:40:22
6440
举报
文章被收录于专栏:祝威廉祝威廉

This is the third article of Your Guide with MLSQL Stack series. We hope this article series shows you how MLSQL stack helps people do AI job.

As we have seen in the previous posts that MLSQL stack give you the power to use the built-in Algorithms and Python ML frameworks. The ability to use Python ML framework means you are totally free to use Deep Learning tools like PyTorch, Tensorflow. But this time, we will teach you how to use built-in DL framework called BigDL to accomplish image classification task first.

Requirements

This guide requires MLSQL Stack 1.3.0-SNAPSHOT. You can setup MLSQL stack with following links. We recommend you deploy MLSQL stack in local.

  1. Docker
  2. Mannually Compile
  3. Prebuild Distribution

If you meet any problem when deploying, please let me know and feel free to address any issue in this link.

Project Structure

I have created a project named store1, and there is a directory called image_classify contains all mlsql script we talk today. It looks like this:

image.png

We will teach you how to build the project step by step.

Upload Image

First, download cifar10 raw images from url: https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz ungzip it and make sure it's a tar file.

Though MLSQL Console supports directory uploading, but the huge number of files in the directory will crash the uploading component in the web page, and of course, we hope we can fix this issue in future. Now, there is one way that packaging the directory as a tar file to walk around this uploading crash issue.

image.png

then save upload tar file to your home:

代码语言:javascript
AI代码解释
复制
-- download cifar data from https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz
!fs -mkdir -p /tmp/cifar;
!saveUploadFileToHome /cifar.tar /tmp/cifar;

the console will show the real-time log which indicates that the system is extracting images.

image.png

This may take for a while because there are almost 60000 pictures.

Setup some paths.

We create a env.mlsql which contains variables path related:

代码语言:javascript
AI代码解释
复制
set basePath="/tmp/cifar"; 
set labelMappingPath = "${basePath}/si";
set trainDataPath = "${basePath}/cifar_train_data";
set testDataPath = "${basePath}/cifar_test_data";
set modelPath = "${basePath}/bigdl";

And the other script will include this script to get all these paths.

Resize the pictures

We hope we can resize the images to 28*28, you can achieve it with ET ImageLoaderExt. Here are how we use it:

代码语言:javascript
AI代码解释
复制
include store1.`alg.image_classify.env.mlsql`;

-- {} or {number} is used as parameter holder.
set imageResize='''
run command as ImageLoaderExt.`/tmp/cifar/cifar/{}` where 
and code="
    def apply(params:Map[String,String]) = {
         Resize(28, 28) ->
          MatToTensor() -> ImageFrameToSample()
      }
"
as {}
''';

-- train should be quoted because it's a keyword.
!imageResize "train" data;
!imageResize test testData;

In the above code, because we need to resize train and test dataset, in order to avoid duplicate code, we wrap the resize code as a command, then use this command to process train and test dataset separately.

Extract label

For example, When we see the following path we know that this picture contains frog. So we should extract frog from the path.

代码语言:javascript
AI代码解释
复制
/tmp/cifar/cifar/train/38189_frog.png

Again, we wrap the SQL as a command and process the train and test data separately.

代码语言:javascript
AI代码解释
复制
set extractLabel='''
-- convert image path to number label
select split(split(imageName,"_")[1],"\\.")[0] as labelStr,features from {} as {}
''';

!extractLabel data newdata;
!extractLabel testData newTestData;

We will convert the label to number and then plus 1(cause the bigdl needs the label starts from 1 instead of 0).

代码语言:javascript
AI代码解释
复制
set numericLabel='''
train {0} as StringIndex.`/tmp/cifar/si` where inputCol="labelStr" and outputCol="labelIndex" as newdata1;
predict {0} as StringIndex.`/tmp/cifar/si` as newdata2;
select (cast(labelIndex as int) + 1) as label,features from newdata2 as {1}
''';

!numericLabel newdata trainData;
!numericLabel newTestData testData;

Save what we get until now

We will save all these data so we can use the processed data in future without executing repeatedly:

代码语言:javascript
AI代码解释
复制
save overwrite trainData as parquet.`${trainDataPath}`;
save overwrite testData as parquet.`${testDataPath}`;

Train the images with DL

We create a new script file named classify_train.mlsql, and we should load the data first and convert the label to an array:

代码语言:javascript
AI代码解释
复制
include store1.`alg.image_classify.env.mlsql`;

load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;

select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;

finally, we use our algorithm to train them:

代码语言:javascript
AI代码解释
复制
train trainData as BigDLClassifyExt.`${modelPath}` where
disableSparkLog = "true"
and fitParam.0.featureSize="[3,28,28]"
and fitParam.0.classNum="10"
and fitParam.0.maxEpoch="300"

-- print evaluate message
and fitParam.0.evaluate.trigger.everyEpoch="true"
and fitParam.0.evaluate.batchSize="1000"
and fitParam.0.evaluate.table="testData"
and fitParam.0.evaluate.methods="Loss,Top1Accuracy"
-- for unbalanced class 
-- and fitParam.0.criterion.classWeight="[......]"
and fitParam.0.code='''
                   def apply(params:Map[String,String])={
                        val model = Sequential()
                        model.add(Reshape(Array(3, 28, 28), inputShape = Shape(28, 28, 3)))
                        model.add(Convolution2D(6, 5, 5, activation = "tanh").setName("conv1_5x5"))
                        model.add(MaxPooling2D())
                        model.add(Convolution2D(12, 5, 5, activation = "tanh").setName("conv2_5x5"))
                        model.add(MaxPooling2D())
                        model.add(Flatten())
                        model.add(Dense(100, activation = "tanh").setName("fc1"))
                        model.add(Dense(params("classNum").toInt, activation = "softmax").setName("fc2"))
                    }
'''
;

Int the code block, we use Keras-style code to build our model, and we tell our system some information e.g. how many classes and what's the feature size.

If this training stage takes too long, you can decrease fitParam.0.maxEpoch to a small value.

The console will print the message when training:

image.png

and finally the validate result:

image.png

Use model command to check the model train history:

代码语言:javascript
AI代码解释
复制
!model history /tmp/cifar/bigdl;

Here are the result:

image.png

Register the model as a function

Since we have built our model, now let us learn how to predict the image. First, we load some data:

代码语言:javascript
AI代码解释
复制
include store1.`alg.image_classify.env.mlsql`;

load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;

select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;

now, we can register the model as a function:

代码语言:javascript
AI代码解释
复制
register BigDLClassifyExt.`${modelPath}` as cifarPredict;

finally, we can use the function to predict a new picture:

代码语言:javascript
AI代码解释
复制
select
vec_argmax(cifarPredict(vec_dense(to_array_double(features)))) as predict_label,
label from testData limit 10 
as output;

Of course, you can predict a table:

代码语言:javascript
AI代码解释
复制
predict testData as BigDLClassifyExt.`${modelPath}` as predictdata;

Why BigDL

GPU is very expensive and normally, our company already have lots of CPUs, if we can make full use of these CPUs which will save a lot of money.

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2019.05.16 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
暂无评论
推荐阅读
编辑精选文章
换一批
Your Guide to NLP with MLSQL Stack (一)
MLSQL stack supports a complete pipeline of train/predict. This means the following steps can be in the same script:
用户2936994
2019/05/15
6290
Your Guide to Python with MLSQL Stack (二)
In the previous post Your Guide to NLP with MLSQL Stack (一), we already have known how to build a RandomForest model to classify text content. The TF/IDF, RandomForest are all built-in algorithms and implemented by Java. In this post, we will show you how to use Python to do the same job.
用户2936994
2019/05/14
6170
Your Guide to Python with MLSQL Stack (二)
MLSQL拥抱BigDL,轻轻松松无编码玩深度学习
原谅我,前半句是真的,后半句是噱头,但是真的很简化了。 MLSQL已经有一个相对来比较完善的Python Runtime,细节可以参看这篇文章,所以玩深度学习是很容易的,不过需要你提供一段tensorflow代码或者项目。
用户2936994
2018/10/15
5150
新一代AI平台-MLSQL ,加入开源社区吧!
MLSQL社区希望人人都能够参与进来。开源应该是普惠的,这种普惠应该是在价值的发挥上,以及社区的参与上。我们认为积极的社区参与体现在如下点:
木东居士
2020/04/26
1.1K0
新一代AI平台-MLSQL ,加入开源社区吧!
写给【工程同学】的MLSQL机器学习教程
http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz
用户2936994
2022/04/25
4280
写给【工程同学】的MLSQL机器学习教程
AI模型注册成MLSQL UDF函数示例
训练一个Tensorflow模型 下面的代码仅支持Console notebook模式下运行 首先,准备minist数据集 include lib.`github.com/allwefantasy/lib-core` where force="true" and libMirror="gitee.com" and -- proxy configuration. alias="libCore"; -- dump minist data to object storage include
用户2936994
2022/07/21
3900
如何实现语法的自解释(MLSQL易用性设计有感)
突然想明白了一件事, 语法应该是自解释的。什么意思呢,就是用户需要有一个学习语法的语法,而这个语法应该极度简单,他只要花上一分钟,甚至依靠直觉就能知道怎么用,透过这个口,以点窥面,让用户具备自主学习其他语法的能力。
用户2936994
2018/09/29
5940
如何实现语法的自解释(MLSQL易用性设计有感)
MLSQL Stack 让流调试更加简单
有一位同学正在调研MLSQL Stack对流的支持。然后说了流调试其实挺困难的。经过实践,希望实现如下三点:
用户2936994
2019/06/05
4520
MLSQL Stack 让流调试更加简单
MLSQL解决了什么问题
在谈MLSQL解决了什么问题之前,我们先提一个“数据中台”的概念。什么是数据中台呢?数据中台至少应该具备如下三个特点:
用户2936994
2018/12/28
1K0
MLSQL解决了什么问题
在谈MLSQL解决了什么问题之前,我们先提一个“数据中台”的概念。什么是数据中台呢?数据中台至少应该具备如下三个特点:
木东居士
2020/04/26
8620
MLSQL解决了什么问题
用MLSQL完成简书图片备份
我今天正好想做两个事,第一个是,我想把我简书内容备份下来,但是官方提供的备份功能只能备份成markdown,然后发现图片没办法备份。所以我需要把我简书里的所有图片下载下来。
用户2936994
2019/04/01
5320
用MLSQL完成简书图片备份
算法训练和模型部署如何避免多次重写数据预处理代码
前段时间,我们对接算法的工程师哭丧的和我说,模型生成后一般都要部署成API的形态对外提供服务,但是算法工程师并没有提供如何将一条数据转化特征向量的方法,他能拿到的是代码逻辑以及一些“中间元数据”。数据预处理本来就复杂,翻译也是一件极其困难的事情。我解释了这件事情难以解决的原因,但是显然他还是有些失望。
用户1332428
2018/07/30
9160
MLSQL解决了什么问题
1、项目难以重现,可阅读性和环境要求导致能把另外一个同事写的python项目运行起来不得不靠运气
用户1332428
2018/07/30
8990
MLSQL-ET开发指南
MLSQL具备足够灵活的扩展性,能够同时解决 Data + AI 领域的问题。我们提供了大量的插件,方便用户在数据处理、商业分析和机器学习的不同场景中使用 MLSQL。这些插件类型包括: DataSource、ET、Script、App,我们都可以灵活的通过离线或者线上的方式注册到 MLSQL Engine 中使用。 在 MLSQL 中,ET(Estimator/Transformer的简称)是一个非常重要的概念。通过 ET,我们可以完成非常多的复杂任务。包括:
用户2936994
2022/01/07
8890
MLSQL-ET开发指南
为什么去开发一个MLSQL
第一个,算法的着眼点是,用最快速的方式清洗一些数据出来,然后接着建模训练,评估预测效果,之后再重复清洗数据,再试验。因为很多算法工程师都是Python系的,对他们来说,最简单的方式自然是写python程序。一旦确认清洗方式后,这种数据清洗工作,最后研发工程师还要再重新用Spark去实现一遍。那么如果让算法工程师在做数据清洗的时候,直接使用PySpark呢?这样复用程度是不是可以有所提高?实际上是有的。但是算法工程师初期用起来会比较吃力,因为PySpark的学习成本还是有的,而且不小。
用户2936994
2018/08/27
7620
为什么去开发一个MLSQL
谷歌BigQuery ML VS StreamingPro MLSQL
今天看到了一篇 AI前线的文章谷歌BigQuery ML正式上岗,只会用SQL也能玩转机器学习!。正好自己也在力推 StreamingPro的MLSQL。 今天就来对比下这两款产品。
用户2936994
2018/08/27
1.8K0
【1】paddle飞桨框架高层API使用讲解
飞桨框架2.0全新推出高层API,是对飞桨API的进一步封装与升级,提供了更加简洁易用的API,进一步提升了飞桨的易学易用性,并增强飞桨的功能。
汀丶人工智能
2022/12/21
9930
【1】paddle飞桨框架高层API使用讲解
手把手教你R语言随机森林使用
随机森林是常用的非线性用于构建分类器的算法,它是由数目众多的弱决策树构建成森林进而对结果进行投票判断标签的方法。
生信学习者
2024/06/11
1.2K0
手把手教你R语言随机森林使用
MLSQL 对Python的支持之路
Python是做机器学习框架一定要支持的。MLSQL很早就支持集成Python脚本做模型的训练和预测。
用户2936994
2018/10/11
8730
Byzer + OpenMLDB 实现端到端的,基于实时特征计算的机器学习流程
本文示范如何使用OpenMLDB和 Byzer-lang 联合完成一个完整的机器学习应用。Byzer-lang 作为面向大数据和AI的一门语言,通过 Byzer-Notebook 和用户进行交互,用户可以轻松完成数据的抽取,ETL,特征/模型训练,保存,部署到最后预测等整个端到端的机器学习流程。OpenMLDB在本例中接收Byzer发送的指令和数据,完成数据的实时特征计算,并经特征工程处理后的数据集返回Byzer,供其进行后续的机器学习训练和预测。
用户2936994
2022/10/27
1.2K0
相关推荐
Your Guide to NLP with MLSQL Stack (一)
更多 >
领券
社区新版编辑器体验调研
诚挚邀请您参与本次调研,分享您的真实使用感受与建议。您的反馈至关重要,感谢您的支持与参与!
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
首页
学习
活动
专区
圈层
工具
MCP广场
首页
学习
活动
专区
圈层
工具
MCP广场