选自MetaFlow
作者:Morgan
机器之心编译
参与:李亚洲、蒋思源
在这篇文章中,作者根据自己的经验为 TensorFlow 初学者给出了设计文件、文件夹架构的建议。在管理自己的项目时,这会是非常有帮助的。
在机器学习中,设计正确的文件架构并不简单。我自己在几个项目上纠结过此问题之后,我开始寻找简单的模式,并希望其能覆盖大部分在读代码或自己编代码时遇到的使用案例。
在此文章中,我会分享我自己的发现。
声明:该文章更像是建议,而非明确的指导,但我感觉挺成功的。该文章意在为初学者提供起点,可能会引发一些讨论。因为一开始我想要为自己的工作设计文件架构,我想我能分享下这方面的内容。如果你有更好的文件架构理论,可以留言分享。
总需要得到什么?
想下在你做机器学习的时候,你必须要做的是什么?
在构造文件和文件夹时,很容易就会忘记以上这些。此外,可能还有其他需求我并未列出。下面,让我们寻找一些最好的实践。
整体文件夹架构
一图胜千言:
文件架构
注释:请在结果文件夹中添加一个「.gitkeep」文件和为「.gitignore」文件添加一个文件夹。因为你也许不希望将所有试验都放到 Github 上,并需要避免代码在首次安装时因为文件夹丢失而中断。
这些都是十分基础的。当然,也许还需要添加其他文件夹,但那些都能归结到这一基本集中。
通过将良好的 README 和其他 bash 脚本作为辅助。任何人希望使用你的资源库(repository)都可以通过「Install」命令和「Usage」命令复制你的研究。
基本模型
正如我所说的,我最终意识到模型中的模式是通过 TF 工程化的东西。这一点引领着我我设计了一个非常简单的类(class),其可以由我未来的模型所扩展。
我并不是继承类别(class inheritance)的热衷者,但我也不是永远清晰复写一段相同代码的热衷者。当你在进行机器学习项目时,模型通过你使用的框架共享了许多相似之处。
所以我试图找到一个避免继承的(inheritance)已知香蕉问题(banana problem)的实现,这是通过让一个继承尽可能地深而达到。
要完全清楚,我们需要将这一类别作为以后模型的顶部父级类别(top parent),令你模型的构建在一行使用一个变元(one argument):配置(the configuration)。
为了更进一步理解,我们将为你直接展示注释文件(commented file):
import os, copy
import tensorflow as tf
class BasicAgent(object):
# To build your model, you only to pass a "configuration" which is a dictionary
def __init__(self, config):
# I like to keep the best HP found so far inside the model itself
# This is a mechanism to load the best HP and override the configuration
if config['best']:
config.update(self.get_best_config(config['env_name']))
# I make a `deepcopy` of the configuration before using it
# to avoid any potential mutation when I iterate asynchronously over configurations
self.config = copy.deepcopy(config)
if config['debug']: # This is a personal check i like to do
print('config', self.config)
# When working with NN, one usually initialize randomly
# and you want to be able to reproduce your initialization so make sure
# you store the random seed and actually use it in your TF graph (tf.set_random_seed() for example)
self.random_seed = self.config['random_seed']
# All models share some basics hyper parameters, this is the section where we
# copy them into the model
self.result_dir = self.config['result_dir']
self.max_iter = self.config['max_iter']
self.lr = self.config['lr']
self.nb_units = self.config['nb_units']
# etc.
# Now the child Model needs some custom parameters, to avoid any
# inheritance hell with the __init__ function, the model
# will override this function completely
self.set_agent_props()
# Again, child Model should provide its own build_grap function
self.graph = self.build_graph(tf.Graph())
# Any operations that should be in the graph but are common to all models
# can be added this way, here
with self.graph.as_default():
self.saver = tf.train.Saver(
max_to_keep=50,
)
# Add all the other common code for the initialization here
gpu_options = tf.GPUOptions(allow_growth=True)
sessConfig = tf.ConfigProto(gpu_options=gpu_options)
self.sess = tf.Session(config=sessConfig, graph=self.graph)
self.sw = tf.summary.FileWriter(self.result_dir, self.sess.graph)
# This function is not always common to all models, that's why it's again
# separated from the __init__ one
self.init()
# At the end of this function, you want your model to be ready!
def set_agent_props(self):
# This function is here to be overriden completely.
# When you look at your model, you want to know exactly which custom options it needs.
pass
def get_best_config(self):
# This function is here to be overriden completely.
# It returns a dictionary used to update the initial configuration (see __init__)
return {}
@staticmethod
def get_random_config(fixed_params={}):
# Why static? Because you want to be able to pass this function to other processes
# so they can independently generate random configuration of the current model
raise Exception('The get_random_config function must be overriden by the agent')
def build_graph(self, graph):
raise Exception('The build_graph function must be overriden by the agent')
def infer(self):
raise Exception('The infer function must be overriden by the agent')
def learn_from_epoch(self):
# I like to separate the function to train per epoch and the function to train globally
raise Exception('The learn_from_epoch function must be overriden by the agent')
def train(self, save_every=1):
# This function is usually common to all your models, Here is an example:
for epoch_id in range(0, self.max_iter):
self.learn_from_epoch()
# If you don't want to save during training, you can just pass a negative number
if save_every > 0 and epoch_id % save_every == 0:
self.save()
def save(self):
# This function is usually common to all your models, Here is an example:
global_step_t = tf.train.get_global_step(self.graph)
global_step, episode_id = self.sess.run([global_step_t, self.episode_id])
if self.config['debug']:
print('Saving to %s with global_step %d' % (self.result_dir, global_step))
self.saver.save(self.sess, self.result_dir + '/agent-ep_' + str(episode_id), global_step)
# I always keep the configuration that
if not os.path.isfile(self.result_dir + '/config.json'):
config = self.config
if 'phi' in config:
del config['phi']
with open(self.result_dir + '/config.json', 'w') as f:
json.dump(self.config, f)
def init(self):
# This function is usually common to all your models
# but making separate than the __init__ function allows it to be overidden cleanly
# this is an example of such a function
checkpoint = tf.train.get_checkpoint_state(self.result_dir)
if checkpoint is None:
self.sess.run(self.init_op)
else:
if self.config['debug']:
print('Loading the model from folder: %s' % self.result_dir)
self.saver.restore(self.sess, checkpoint.model_checkpoint_path)
def infer(self):
# This function is usually common to all your models
pass
基本模型文件
一些注释:
The __init__ script
你能在文件夹结构看到初始化脚本(The __init__ script),其和机器学习并没有什么关联。但该脚本是令你的代码对你或其他人更加易读的简单方式。
该脚本通过添加几行代码令任何模型类别都能从命名空间 models 直接可读取:所以你能在代码任一处输入:from models import MyModel,该代码行能导入模型而不用管模型的文件夹路径有多么深。
这里有一个脚本案例来实现这一任务:
from models.basic_model import BasicModel
from agents.other_model import SomeOtherModel
__all__ = [
"BasicModel",
"SomeOtherModel"
]
def make_model(config, env):
if config['model_name'] in __all__:
return globals()[config['model_name']](config, env)
else:
raise Exception('The model name %s does not exist' % config['model_name'])
def get_model_class(config):
if config['model_name'] in __all__:
return globals()[config['model_name']]
else:
raise Exception('The model name %s does not exist' % config['model_name'])
这并没有多高端,但我发现这一脚本十分有用,所以我把它加到本文中了。
API 外壳(The shell API)
我们有一个全局一致的文件夹架构和一个很好的基础类别来构建我们的模型,一个好的 python 脚本很容易加载我们的类(class),但是设计「shell API」,特别是其默认值是同样重要的。
因为与机器学习研究交互的主要结束点就是你使用任何工具的外壳(shell),程序外壳是你实验的基石。
你想要做的最后一件事就是调整你代码中的硬编码值来迭代这些实验,所以你需要从外壳中直接访问所有的超参数。同样你还需要访问所有其他参数,就像结果索引或 stage (HP search/Training/inferring) 等那样。
同样为了更进一步理解,我们将为你直接展示注释文件(commented file):
import os, json
import tensorflow as tf
# See the __init__ script in the models folder
# `make_models` is a helper function to load any models you have
from models import make_models
from hpsearch import hyperband, randomsearch
# I personally always like to make my paths absolute
# to be independent from where the python binary is called
dir = os.path.dirname(os.path.realpath(__file__))
# I won't dig into TF interaction with the shell, feel free to explore the documentation
flags = tf.app.flags
# Hyper-parameters search configuration
flags.DEFINE_boolean('fullsearch', False, 'Perform a full search of hyperparameter space ex:(hyperband -> lr search -> hyperband with best lr)')
flags.DEFINE_boolean('dry_run', False, 'Perform a dry_run (testing purpose)')
flags.DEFINE_integer('nb_process', 4, 'Number of parallel process to perform a HP search')
# fixed_params is a trick I use to be able to fix some parameters inside the model random function
# For example, one might want to explore different models fixing the learning rate, see the basic_model get_random_config function
flags.DEFINE_string('fixed_params', "{}", 'JSON inputs to fix some params in a HP search, ex: \'{"lr": 0.001}\'')
# Agent configuration
flags.DEFINE_string('model_name', 'DQNAgent', 'Unique name of the model')
flags.DEFINE_boolean('best', False, 'Force to use the best known configuration')
flags.DEFINE_float('initial_mean', 0., 'Initial mean for NN')
flags.DEFINE_float('initial_stddev', 1e-2, 'Initial standard deviation for NN')
flags.DEFINE_float('lr', 1e-3, 'The learning rate of SGD')
flags.DEFINE_float('nb_units', 20, 'Number of hidden units in Deep learning agents')
# Environment configuration
flags.DEFINE_boolean('debug', False, 'Debug mode')
flags.DEFINE_integer('max_iter', 2000, 'Number of training step')
flags.DEFINE_boolean('infer', False, 'Load an agent for playing')
# This is very important for TensorBoard
# each model will end up in its own unique folder using time module
# Obviously one can also choose to name the output folder
flags.DEFINE_string('result_dir', dir + '/results/' + flags.FLAGS.model_name + '/' + str(int(time.time())), 'Name of the directory to store/log the model (if it exists, the model will be loaded from it)')
# Another important point, you must provide an access to the random seed
# to be able to fully reproduce an experiment
flags.DEFINE_integer('random_seed', random.randint(0, sys.maxsize), 'Value of random seed')
def main(_):
config = flags.FLAGS.__flags.copy()
# fixed_params must be a string to be passed in the shell, let's use JSON
config["fixed_params"] = json.loads(config["fixed_params"])
if config['fullsearch']:
# Some code for HP search ...
else:
model = make_model(config)
if config['infer']:
# Some code for inference ...
else:
# Some code for training ...
if __name__ == '__main__':
tf.app.run()
以上就是本文想要描述的,我们希望它能帮助新入门者辅助研究,我们同样也欢迎自由评论或提问。
在文章最后,作者还列出了一批有关 TensorFlow 文章,感兴趣的读者可通过英文原文查看。
原文链接:https://blog.metaflow.fr/tensorflow-a-proposal-of-good-practices-for-files-folders-and-models-architecture-f23171501ae3
本文为机器之心编译,转载请联系本公众号获得授权。
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有