首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >数据库虚拟索引工具videx简单使用

数据库虚拟索引工具videx简单使用

原创
作者头像
保持热爱奔赴山海
发布于 2025-05-14 10:15:48
发布于 2025-05-14 10:15:48
1270
举报
文章被收录于专栏:数据库相关数据库相关

videx是字节最近开源的MySQL虚拟索引工具。项目地址 https://github.com/bytedance/videx

具体的可以看官方文档。

这里只列出实操演示过程。

0、演示环境

代码语言:txt
AI代码解释
复制
系统版本: centos7 
IP: 192.168.31.181

1、启动videx-server

代码语言:txt
AI代码解释
复制
docker pull kangrongme/videx:latest       
docker run -d -p 13308:13308 -p 5001:5001 --name videx kangrongme/videx:latest       

2、克隆代码

# 克隆代码

代码语言:txt
AI代码解释
复制
cd /root/
VIDEX_HOME=videx_server
git clone https://github.com/bytedance/videx.git $VIDEX_HOME
cd $VIDEX_HOME

# 创建python虚拟环境

代码语言:txt
AI代码解释
复制
conda create -n videx_py39 python=3.9
conda activate videx_py39

# 安装扩展

代码语言:txt
AI代码解释
复制
python3.9 -m pip install -e . --use-pep517

3、连接到真实MySQL 初始化测试数据集

代码语言:txt
AI代码解释
复制
cd $VIDEX_HOME
cd data/tpch_tiny
tar xf tpch_tiny.sql.tar.gz

mysql -h192.168.31.181 -P3306 -udts -p123456 -e "create database tpch_tiny;"
mysql -h192.168.31.181 -P3306 -udts -p123456 -Dtpch_tiny < tpch_tiny.sql

4、收集并导入 VIDEX 元数据

代码语言:txt
AI代码解释
复制
cd $VIDEX_HOME

-- 这里的videx用的是docker镜像,导入到videx_tpch_tiny这个库里(会自动创建),默认的账号密码是videx和password
python src/sub_platforms/sql_opt/videx/scripts/videx_build_env.py --target 192.168.31.181:3306:tpch_tiny:dts:123456 --videx 192.168.31.181:13308:videx_tpch_tiny:videx:password

日志类似如下:

代码语言:txt
AI代码解释
复制
/usr/local/miniconda3/envs/videx/lib/python3.11/site-packages/pydantic/_internal/_fields.py:192: UserWarning: Field name "schema" in "MySQLConnectionConfig" shadows an attribute in parent "BaseModel"
  warnings.warn(
logging config:  {'version': 1, 'formatters': {'default': {'format': '%(asctime)s [%(process)d:%(thread)d] %(levelname)-8s %(name)-15s [%(filename)s:%(lineno)d] %(_videx_trace_id)s %(message)s', 'datefmt': '%Y-%m-%d %H:%M:%S'}}, 'filters': {'videx_trace_filter': {'()': <class 'sub_platforms.sql_opt.videx.videx_logging.VidexTraceIdFilter'>}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'default', 'level': 'INFO', 'stream': 'ext://sys.stdout', 'filters': ['videx_trace_filter']}, 'info_log_file': {'class': 'logging.handlers.RotatingFileHandler', 'formatter': 'default', 'filename': './log/videx_app_info.log', 'maxBytes': 100000000, 'backupCount': 5, 'filters': ['videx_trace_filter']}, 'error_log_file': {'class': 'logging.handlers.RotatingFileHandler', 'formatter': 'default', 'filename': './log/videx_app_error.log', 'maxBytes': 100000000, 'backupCount': 5, 'filters': ['videx_trace_filter']}}, 'root': {'level': 'NOTSET', 'handlers': ['console', 'info_log_file'], 'propagate': False}, 'loggers': {'error_logger': {'level': 'ERROR', 'handlers': ['error_log_file'], 'propagate': True}}}
2025-05-08 14:05:02 [55657:140020204398400] INFO     root            [videx_build_env.py:145] - metadata file is videx_metadata_tpch_tiny.json
2025-05-08 14:05:02 [55657:140020204398400] INFO     root            [videx_metadata.py:594] - fetch_all_meta_for_videx. target_db='tpch_tiny' result_dir='temp_meta_1746684302' n_buckets=16 hist_force=True hist_mem_size=200000000 all_table_names=None
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:617] - fetch meta for videx: ndv_single_file = 'videx_tpch_tiny_ndv_single.json' not found in result_dir = 'temp_meta_1746684302', or exist ndv single is not enough.fetch it: sorted(ndv_single_dict.keys()) = [] miss_ndv_tables=['customer', 'lineitem', 'nation', 'orders', 'part', 'partsupp', 'region', 'sbtest1', 'sbtest2', 'supplier']
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_ACCTBAL [0/8]: SELECT COUNT(DISTINCT C_ACCTBAL) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_ADDRESS [1/8]: SELECT COUNT(DISTINCT C_ADDRESS) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_COMMENT [2/8]: SELECT COUNT(DISTINCT C_COMMENT) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_CUSTKEY [3/8]: SELECT COUNT(DISTINCT C_CUSTKEY) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_MKTSEGMENT [4/8]: SELECT COUNT(DISTINCT C_MKTSEGMENT) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_ORDERKEY [4/9]: SELECT COUNT(DISTINCT O_ORDERKEY) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_ORDERPRIORITY [5/9]: SELECT COUNT(DISTINCT O_ORDERPRIORITY) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_ORDERSTATUS [6/9]: SELECT COUNT(DISTINCT O_ORDERSTATUS) FROM `tpch_tiny`.`orders`;

2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_NAME [4/9]: SELECT COUNT(DISTINCT P_NAME) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for supplier.tpch_tiny.supplier.S_SUPPKEY [6/7]: SELECT COUNT(DISTINCT S_SUPPKEY) FROM `tpch_tiny`.`supplier`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:639] - fetch meta for videx: hist_file='videx_tpch_tiny_histogram.json' not found in result_dir='temp_meta_1746684302', or exist hist is not enough.fetch it. sorted(hist_dict.keys())=[] miss_hist_tables=['customer', 'lineitem', 'nation', 'orders', 'part', 'partsupp', 'region', 'sbtest1', 'sbtest2', 'supplier']
2025-05-08 14:05:05 [55657:140020204398400] WARNING  root            [db_variable.py:79] - version not updated, return empty str
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_ACCTBAL` with 16 n_buckets
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_ADDRESS` with 16 n_buckets
2025-05-08 14:05:06 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_COMMENT` with 16 n_buckets
2025-05-08 14:05:06 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_CUSTKEY` with 16 n_buckets
2025-05-08 14:05:06 [55657:140020204398400] INFO     root            [videx_histogram.py:765] - Column is covered single uk, force generate: dbname='tpch_tiny', table_name='customer', col_name='C_CUSTKEY'
2025-05-08 14:05:06 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_MKTSEGMENT` with 16 n_buckets
2025-05-08 14:05:06 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_NAME` with 16 n_buckets

/usr/local/miniconda3/envs/videx/lib/python3.11/site-packages/pydantic/_internal/_fields.py:192: UserWarning: Field name "schema" in "MySQLConnectionConfig" shadows an attribute in parent "BaseModel"
  warnings.warn(
logging config:  {'version': 1, 'formatters': {'default': {'format': '%(asctime)s [%(process)d:%(thread)d] %(levelname)-8s %(name)-15s [%(filename)s:%(lineno)d] %(_videx_trace_id)s %(message)s', 'datefmt': '%Y-%m-%d %H:%M:%S'}}, 'filters': {'videx_trace_filter': {'()': <class 'sub_platforms.sql_opt.videx.videx_logging.VidexTraceIdFilter'>}}, 'handlers': {'console': {'class': 'logging.StreamHandler', 'formatter': 'default', 'level': 'INFO', 'stream': 'ext://sys.stdout', 'filters': ['videx_trace_filter']}, 'info_log_file': {'class': 'logging.handlers.RotatingFileHandler', 'formatter': 'default', 'filename': './log/videx_app_info.log', 'maxBytes': 100000000, 'backupCount': 5, 'filters': ['videx_trace_filter']}, 'error_log_file': {'class': 'logging.handlers.RotatingFileHandler', 'formatter': 'default', 'filename': './log/videx_app_error.log', 'maxBytes': 100000000, 'backupCount': 5, 'filters': ['videx_trace_filter']}}, 'root': {'level': 'NOTSET', 'handlers': ['console', 'info_log_file'], 'propagate': False}, 'loggers': {'error_logger': {'level': 'ERROR', 'handlers': ['error_log_file'], 'propagate': True}}}
2025-05-08 14:05:02 [55657:140020204398400] INFO     root            [videx_build_env.py:145] - metadata file is videx_metadata_tpch_tiny.json
2025-05-08 14:05:02 [55657:140020204398400] INFO     root            [videx_metadata.py:594] - fetch_all_meta_for_videx. target_db='tpch_tiny' result_dir='temp_meta_1746684302' n_buckets=16 hist_force=True hist_mem_size=200000000 all_table_names=None
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:617] - fetch meta for videx: ndv_single_file = 'videx_tpch_tiny_ndv_single.json' not found in result_dir = 'temp_meta_1746684302', or exist ndv single is not enough.fetch it: sorted(ndv_single_dict.keys()) = [] miss_ndv_tables=['customer', 'lineitem', 'nation', 'orders', 'part', 'partsupp', 'region', 'sbtest1', 'sbtest2', 'supplier']
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_ACCTBAL [0/8]: SELECT COUNT(DISTINCT C_ACCTBAL) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_ADDRESS [1/8]: SELECT COUNT(DISTINCT C_ADDRESS) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_COMMENT [2/8]: SELECT COUNT(DISTINCT C_COMMENT) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_CUSTKEY [3/8]: SELECT COUNT(DISTINCT C_CUSTKEY) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_MKTSEGMENT [4/8]: SELECT COUNT(DISTINCT C_MKTSEGMENT) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_NAME [5/8]: SELECT COUNT(DISTINCT C_NAME) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_NATIONKEY [6/8]: SELECT COUNT(DISTINCT C_NATIONKEY) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for customer.tpch_tiny.customer.C_PHONE [7/8]: SELECT COUNT(DISTINCT C_PHONE) FROM `tpch_tiny`.`customer`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_COMMENT [0/17]: SELECT COUNT(DISTINCT L_COMMENT) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_COMMITDATE [1/17]: SELECT COUNT(DISTINCT L_COMMITDATE) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_DISCOUNT [2/17]: SELECT COUNT(DISTINCT L_DISCOUNT) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_EXTENDEDPRICE [3/17]: SELECT COUNT(DISTINCT L_EXTENDEDPRICE) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_ID [4/17]: SELECT COUNT(DISTINCT L_ID) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_LINENUMBER [5/17]: SELECT COUNT(DISTINCT L_LINENUMBER) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_LINESTATUS [6/17]: SELECT COUNT(DISTINCT L_LINESTATUS) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_ORDERKEY [7/17]: SELECT COUNT(DISTINCT L_ORDERKEY) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_PARTKEY [8/17]: SELECT COUNT(DISTINCT L_PARTKEY) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_QUANTITY [9/17]: SELECT COUNT(DISTINCT L_QUANTITY) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_RECEIPTDATE [10/17]: SELECT COUNT(DISTINCT L_RECEIPTDATE) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_RETURNFLAG [11/17]: SELECT COUNT(DISTINCT L_RETURNFLAG) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_SHIPDATE [12/17]: SELECT COUNT(DISTINCT L_SHIPDATE) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:04 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_SHIPINSTRUCT [13/17]: SELECT COUNT(DISTINCT L_SHIPINSTRUCT) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_SHIPMODE [14/17]: SELECT COUNT(DISTINCT L_SHIPMODE) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_SUPPKEY [15/17]: SELECT COUNT(DISTINCT L_SUPPKEY) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for lineitem.tpch_tiny.lineitem.L_TAX [16/17]: SELECT COUNT(DISTINCT L_TAX) FROM `tpch_tiny`.`lineitem`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for nation.tpch_tiny.nation.N_COMMENT [0/4]: SELECT COUNT(DISTINCT N_COMMENT) FROM `tpch_tiny`.`nation`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for nation.tpch_tiny.nation.N_NAME [1/4]: SELECT COUNT(DISTINCT N_NAME) FROM `tpch_tiny`.`nation`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for nation.tpch_tiny.nation.N_NATIONKEY [2/4]: SELECT COUNT(DISTINCT N_NATIONKEY) FROM `tpch_tiny`.`nation`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for nation.tpch_tiny.nation.N_REGIONKEY [3/4]: SELECT COUNT(DISTINCT N_REGIONKEY) FROM `tpch_tiny`.`nation`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_CLERK [0/9]: SELECT COUNT(DISTINCT O_CLERK) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_COMMENT [1/9]: SELECT COUNT(DISTINCT O_COMMENT) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_CUSTKEY [2/9]: SELECT COUNT(DISTINCT O_CUSTKEY) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_ORDERDATE [3/9]: SELECT COUNT(DISTINCT O_ORDERDATE) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_ORDERKEY [4/9]: SELECT COUNT(DISTINCT O_ORDERKEY) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_ORDERPRIORITY [5/9]: SELECT COUNT(DISTINCT O_ORDERPRIORITY) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_ORDERSTATUS [6/9]: SELECT COUNT(DISTINCT O_ORDERSTATUS) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_SHIPPRIORITY [7/9]: SELECT COUNT(DISTINCT O_SHIPPRIORITY) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for orders.tpch_tiny.orders.O_TOTALPRICE [8/9]: SELECT COUNT(DISTINCT O_TOTALPRICE) FROM `tpch_tiny`.`orders`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_BRAND [0/9]: SELECT COUNT(DISTINCT P_BRAND) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_COMMENT [1/9]: SELECT COUNT(DISTINCT P_COMMENT) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_CONTAINER [2/9]: SELECT COUNT(DISTINCT P_CONTAINER) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_MFGR [3/9]: SELECT COUNT(DISTINCT P_MFGR) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_NAME [4/9]: SELECT COUNT(DISTINCT P_NAME) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_PARTKEY [5/9]: SELECT COUNT(DISTINCT P_PARTKEY) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_RETAILPRICE [6/9]: SELECT COUNT(DISTINCT P_RETAILPRICE) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_SIZE [7/9]: SELECT COUNT(DISTINCT P_SIZE) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for part.tpch_tiny.part.P_TYPE [8/9]: SELECT COUNT(DISTINCT P_TYPE) FROM `tpch_tiny`.`part`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for partsupp.tpch_tiny.partsupp.PS_AVAILQTY [0/6]: SELECT COUNT(DISTINCT PS_AVAILQTY) FROM `tpch_tiny`.`partsupp`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for partsupp.tpch_tiny.partsupp.PS_COMMENT [1/6]: SELECT COUNT(DISTINCT PS_COMMENT) FROM `tpch_tiny`.`partsupp`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for partsupp.tpch_tiny.partsupp.PS_ID [2/6]: SELECT COUNT(DISTINCT PS_ID) FROM `tpch_tiny`.`partsupp`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for partsupp.tpch_tiny.partsupp.PS_PARTKEY [3/6]: SELECT COUNT(DISTINCT PS_PARTKEY) FROM `tpch_tiny`.`partsupp`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for partsupp.tpch_tiny.partsupp.PS_SUPPKEY [4/6]: SELECT COUNT(DISTINCT PS_SUPPKEY) FROM `tpch_tiny`.`partsupp`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for partsupp.tpch_tiny.partsupp.PS_SUPPLYCOST [5/6]: SELECT COUNT(DISTINCT PS_SUPPLYCOST) FROM `tpch_tiny`.`partsupp`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for region.tpch_tiny.region.R_COMMENT [0/3]: SELECT COUNT(DISTINCT R_COMMENT) FROM `tpch_tiny`.`region`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for region.tpch_tiny.region.R_NAME [1/3]: SELECT COUNT(DISTINCT R_NAME) FROM `tpch_tiny`.`region`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for region.tpch_tiny.region.R_REGIONKEY [2/3]: SELECT COUNT(DISTINCT R_REGIONKEY) FROM `tpch_tiny`.`region`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for sbtest1.tpch_tiny.sbtest1.c [0/4]: SELECT COUNT(DISTINCT c) FROM `tpch_tiny`.`sbtest1`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for sbtest1.tpch_tiny.sbtest1.id [1/4]: SELECT COUNT(DISTINCT id) FROM `tpch_tiny`.`sbtest1`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for sbtest1.tpch_tiny.sbtest1.k [2/4]: SELECT COUNT(DISTINCT k) FROM `tpch_tiny`.`sbtest1`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for sbtest1.tpch_tiny.sbtest1.pad [3/4]: SELECT COUNT(DISTINCT pad) FROM `tpch_tiny`.`sbtest1`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for sbtest2.tpch_tiny.sbtest2.aaa [0/5]: SELECT COUNT(DISTINCT aaa) FROM `tpch_tiny`.`sbtest2`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for sbtest2.tpch_tiny.sbtest2.c [1/5]: SELECT COUNT(DISTINCT c) FROM `tpch_tiny`.`sbtest2`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for sbtest2.tpch_tiny.sbtest2.id [2/5]: SELECT COUNT(DISTINCT id) FROM `tpch_tiny`.`sbtest2`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for sbtest2.tpch_tiny.sbtest2.k [3/5]: SELECT COUNT(DISTINCT k) FROM `tpch_tiny`.`sbtest2`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for sbtest2.tpch_tiny.sbtest2.pad [4/5]: SELECT COUNT(DISTINCT pad) FROM `tpch_tiny`.`sbtest2`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for supplier.tpch_tiny.supplier.S_ACCTBAL [0/7]: SELECT COUNT(DISTINCT S_ACCTBAL) FROM `tpch_tiny`.`supplier`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for supplier.tpch_tiny.supplier.S_ADDRESS [1/7]: SELECT COUNT(DISTINCT S_ADDRESS) FROM `tpch_tiny`.`supplier`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for supplier.tpch_tiny.supplier.S_COMMENT [2/7]: SELECT COUNT(DISTINCT S_COMMENT) FROM `tpch_tiny`.`supplier`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for supplier.tpch_tiny.supplier.S_NAME [3/7]: SELECT COUNT(DISTINCT S_NAME) FROM `tpch_tiny`.`supplier`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for supplier.tpch_tiny.supplier.S_NATIONKEY [4/7]: SELECT COUNT(DISTINCT S_NATIONKEY) FROM `tpch_tiny`.`supplier`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for supplier.tpch_tiny.supplier.S_PHONE [5/7]: SELECT COUNT(DISTINCT S_PHONE) FROM `tpch_tiny`.`supplier`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:406] - Fetch NDV for supplier.tpch_tiny.supplier.S_SUPPKEY [6/7]: SELECT COUNT(DISTINCT S_SUPPKEY) FROM `tpch_tiny`.`supplier`;
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_metadata.py:639] - fetch meta for videx: hist_file='videx_tpch_tiny_histogram.json' not found in result_dir='temp_meta_1746684302', or exist hist is not enough.fetch it. sorted(hist_dict.keys())=[] miss_hist_tables=['customer', 'lineitem', 'nation', 'orders', 'part', 'partsupp', 'region', 'sbtest1', 'sbtest2', 'supplier']
2025-05-08 14:05:05 [55657:140020204398400] WARNING  root            [db_variable.py:79] - version not updated, return empty str
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_ACCTBAL` with 16 n_buckets
2025-05-08 14:05:05 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_ADDRESS` with 16 n_buckets
2025-05-08 14:05:06 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_COMMENT` with 16 n_buckets
2025-05-08 14:05:06 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_CUSTKEY` with 16 n_buckets
2025-05-08 14:05:06 [55657:140020204398400] INFO     root            [videx_histogram.py:765] - Column is covered single uk, force generate: dbname='tpch_tiny', table_name='customer', col_name='C_CUSTKEY'
2025-05-08 14:05:06 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`customer`.`C_MKTSEGMENT` with 16 n_buckets
2025-05-08 14:05:23 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`lineitem`.`L_EXTENDEDPRICE` with 16 n_buckets
2025-05-08 14:05:28 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`lineitem`.`L_ID` with 16 n_buckets
2025-05-08 14:05:28 [55657:140020204398400] INFO     root            [videx_histogram.py:765] - Column is covered single uk, force generate: dbname='tpch_tiny', table_name='lineitem', col_name='L_ID'
2025-05-08 14:05:55 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`lineitem`.`L_RECEIPTDATE` with 16 n_buckets
2025-05-08 14:06:00 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`lineitem`.`L_RETURNFLAG` with 16 n_buckets
2025-05-08 14:06:05 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`lineitem`.`L_SHIPDATE` with 16 n_buckets
2025-05-08 14:06:11 [55657:140020204398400] INFO     root            [videx_histogram.py:813] - Generating Histogram for `tpch_tiny`.`lineitem`.`L_SHIPINSTRUCT` with 16 n_buckets

在执行上面的python命令的时候,我提前开启了mysql的general_log日志,发现它底层执行了这些命令(这里只摘录了部分命令):

代码语言:txt
AI代码解释
复制
    SELECT TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE, ENGINE, 
                   VERSION, ROW_FORMAT, TABLE_ROWS, AVG_ROW_LENGTH, DATA_LENGTH, 
                   MAX_DATA_LENGTH, INDEX_LENGTH, DATA_FREE, AUTO_INCREMENT, 
                   CREATE_TIME, UPDATE_TIME, CHECK_TIME, TABLE_COLLATION, 
                   CHECKSUM, CREATE_OPTIONS, TABLE_COMMENT 
            FROM information_schema.TABLES 
            WHERE table_schema = 'tpch_tiny' and ENGINE = 'InnoDB';
            
    show table status in `tpch_tiny` like 'customer';

    select table_schema, table_name, column_name, ordinal_position, is_nullable,
                    data_type, character_maximum_length, character_octet_length, numeric_precision,
                    numeric_scale, datetime_precision, character_set_name, collation_name,
                    column_type, column_key, extra 
                from information_schema.columns 
                where table_schema='tpch_tiny' and table_name='customer';
                
    select table_schema as dbname, table_name as table_name, index_name as index_name, 
                                non_unique as non_unique, seq_in_index as seq_in_index,
                                column_name as column_name, cardinality as cardinality,
                                sub_part as sub_part, is_visible as is_visible,
                                expression as expression, collation as collation, index_type as index_type
                    from information_schema.statistics
                    where table_schema = 'tpch_tiny' and table_name='customer';

    select TABLE_NAME, N_ROWS,CLUSTERED_INDEX_SIZE, SUM_OF_OTHER_INDEX_SIZES from `mysql`.`innodb_table_stats` where database_name='tpch_tiny';
    
    select n_rows, clustered_index_size, sum_of_other_index_sizes from mysql.innodb_table_stats where database_name='tpch_tiny' and table_name='customer';

    show create table `tpch_tiny`.`customer`;

    SELECT COUNT(DISTINCT P_COMMENT) FROM `tpch_tiny`.`part`;
    
    -- 直方图信息
    ANALYZE TABLE `tpch_tiny`.`customer` UPDATE HISTOGRAM ON C_ACCTBAL WITH 16 BUCKETS;
    ANALYZE TABLE `tpch_tiny`.`customer` DROP HISTOGRAM ON C_ACCTBAL;

    -- 数据抽样 
    SELECT MIN(C_CUSTKEY) as min, MAX(C_CUSTKEY) as max FROM tpch_tiny.customer;
    SELECT COUNT(1) FROM tpch_tiny.customer WHERE C_CUSTKEY IS NULL;
    SELECT COUNT(1) FROM tpch_tiny.customer;    
    SELECT COUNT(1) as bucket_count, COUNT(DISTINCT C_CUSTKEY) as bucket_ndv,
            MIN(C_CUSTKEY) as actual_min, MAX(C_CUSTKEY) as actual_max
            FROM tpch_tiny.customer
            WHERE C_CUSTKEY >= 9390 AND C_CUSTKEY < 18755;

5、开始验证

登录到docker部署的videx-server的mysql中,查看执行计划:

代码语言:txt
AI代码解释
复制
    mysql -h192.168.31.181 -P13308 -uvidex -ppassword -Dvidex_tpch_tiny
    -- SET @VIDEX_SERVER='127.0.0.1:5001'; -- Not needed for Docker deployment
    -- Connect VIDEX-Optimizer: mysql -h127.0.0.1 -P13308 -uvidex -ppassword -Dvidex_tpch_tiny

    USE videx_tpch_tiny;
    EXPLAIN 
    FORMAT = JSON
    SELECT s_name, count(*) AS numwait
    FROM supplier,
         lineitem l1,
         orders,
         nation
    WHERE s_suppkey = l1.l_suppkey
      AND o_orderkey = l1.l_orderkey
      AND o_orderstatus = 'F'
      AND l1.l_receiptdate > l1.l_commitdate
      AND EXISTS (SELECT *
                  FROM lineitem l2
                  WHERE l2.l_orderkey = l1.l_orderkey
                    AND l2.l_suppkey <> l1.l_suppkey)
      AND NOT EXISTS (SELECT *
                      FROM lineitem l3
                      WHERE l3.l_orderkey = l1.l_orderkey
                        AND l3.l_suppkey <> l1.l_suppkey
                        AND l3.l_receiptdate > l3.l_commitdate)
      AND s_nationkey = n_nationkey
      AND n_name = 'IRAQ'
    GROUP BY s_name
    ORDER BY numwait DESC, s_name;

然后,再把上面的这个explain的sql放到真实的3306的mysql里面执行下,对比二者的explain开销情况,可以看到差别并不大。

此外,可以再做如下的实验:

代码语言:txt
AI代码解释
复制
   1、采集完数据后,在 13308这个Videx数据库实例里,执行一个加索引后查看执行计划的操作:
    > mysql -h127.0.0.1 -P13308 -uvidex -ppassword -Dvidex_tpch_tiny 
    [videx_tpch_tiny] > alter table lineitem add index idx_2222(L_LINENUMBER);
    Query OK, 0 rows affected (0.04 sec)
    Records: 0  Duplicates: 0  Warnings: 0

    [videx_tpch_tiny] > explain select * from lineitem where L_LINENUMBER=3 ;
    +----+-------------+----------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
    | id | select_type | table    | partitions | type | possible_keys | key      | key_len | ref   | rows  | filtered | Extra |
    +----+-------------+----------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
    |  1 | SIMPLE      | lineitem | NULL       | ref  | idx_2222      | idx_2222 | 4       | const | 10724 |   100.00 | NULL  |
    +----+-------------+----------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
    1 row in set, 1 warning (0.02 sec)

    [videx_tpch_tiny] > select * from lineitem where L_LINENUMBER=3 ;  -- videx不支持真实的查询操作
    ERROR 1031 (HY000): Table storage engine for 'lineitem' doesn't have this option
    
    2、在真实MySQL上执行上述操作
    [tpch_tiny] > explain select * from lineitem where L_LINENUMBER=3 ;
    +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
    | id | select_type | table    | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
    +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
    |  1 | SIMPLE      | lineitem | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 59231 |    10.00 | Using where |
    +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
    1 row in set, 1 warning (0.00 sec)

    [tpch_tiny] > alter table lineitem add index idx_2222(L_LINENUMBER);
    Query OK, 0 rows affected (0.26 sec)
    Records: 0  Duplicates: 0  Warnings: 0

    [tpch_tiny] > explain select * from lineitem where L_LINENUMBER=3 ;
    +----+-------------+----------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
    | id | select_type | table    | partitions | type | possible_keys | key      | key_len | ref   | rows  | filtered | Extra |
    +----+-------------+----------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
    |  1 | SIMPLE      | lineitem | NULL       | ref  | idx_2222      | idx_2222 | 4       | const | 10795 |   100.00 | NULL  |
    +----+-------------+----------+------------+------+---------------+----------+---------+-------+-------+----------+-------+
    1 row in set, 1 warning (0.00 sec)


    3、可以看到videx的仿真效果还是不错的。 

6、总结

  1. 从实验上看,videx的explain的还是不错的,比我之前自己写代码抽样真实数据的简单并且开销低很多
  2. 在执行python采样脚本的时候,看了下虽然是单线程执行的,但是ANALYZE TABLE 更新HISTOGRAM的时候会持MDL锁的
  3. 在执行python采样脚本的时候,如果源库里面执行count时候,遇到很多大表问题可能造成磁盘IO过高的情况
  4. 如果在源库里后加的表,执行python采样脚本不会执行再次采样命令,需要先删除 videx_metadata_tpch_tiny.json 这个文件后重新采集统计信息,目前看还没支持增量采样(后续可能会支持这个功能,不然开销会比较大)
  5. 执行脚本会到数据库执行analyze操作,这个操作是会记录到binlog里面的。如果我们连接到从库去执行脚本,会导致主从binlog不一样,造成主从切换失败。但是如果连接到主库,又会对主库造成压力,影响主库服务稳定性。这个改下代码应该比较好解决。
  6. 注意:这个videx项目只是数据采样,不包括sql的索引推荐,索引推荐还是需要自己写代码去枚举可能的索引清单。拿到可能的索引清单后,就可以直接到videx去执行加索引操作,然后再次explain查询评估最新的sql cost情况。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 0、演示环境
  • 1、启动videx-server
  • 2、克隆代码
  • 3、连接到真实MySQL 初始化测试数据集
  • 4、收集并导入 VIDEX 元数据
  • 5、开始验证
  • 6、总结
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档