某天本猿按部就班地上班,喝着一杯刚刚好的白开水,一缕阳光透过没有关好的窗帘偷偷照进了我的座位,看着安静的工作群,刷着各种新闻,溜达一下各大社区,这摸鱼时间真的太好了。。。然鹅,客服小姐姐的一条消息打破一切的宁静,又要开始修BUG了!!!!
经过一番研究后,锁定问题根源是查询语句过于复杂,并且是多个大表联查,导致查询耗时非常慢。(SQL语句都有用到索引)。
其中之一SQL语句如下:
SELECT
COUNT( DISTINCT shop_buyer_table.shop_buyer_id )
FROM
shop_buyer_two_1 shop_buyer_table
INNER JOIN ( SELECT shop_buyer_id FROM trade_two_2 WHERE seller_id = 1 AND tid IN ( '567447201041519047', '676218113338647025', '571344929052519047', '1293487669586767982' ) ) trade_table ON trade_table.shop_buyer_id = shop_buyer_table.shop_buyer_id
INNER JOIN (
SELECT
shop_buyer_id
FROM
short_link_mobile_two_5
WHERE
1 = 1
AND sms_batch_id IN ( '50000003', '50000023', '50000033', '50135887', '50159568', '50168572', '50174063', '50188130', '50188133' )
AND last_click_time >= '2018-10-23T16:00:00.000Z'
AND last_click_time <= '2019-11-18T16:00:00.000Z'
) short_link_table ON short_link_table.shop_buyer_id = shop_buyer_table.shop_buyer_id
INNER JOIN (
(
SELECT
shop_buyer_id
FROM
(
SELECT DISTINCT
( shop_buyer_id )
FROM
trade_two_2
JOIN ( SELECT shop_buyer_id FROM shop_buyer_two_1 WHERE seller_id = 1 AND buyer_last_time >= '2019-06-18 00:00:00' AND buyer_last_time <= '2019-11-11 23:59:59' ) trade1 USING ( shop_buyer_id )
WHERE
seller_id = 1
AND created >= '2019-06-18 00:00:00'
AND created <= '2019-11-11 23:59:59' ) trade5 JOIN ( SELECT DISTINCT ( shop_buyer_id ) FROM trade_two_2 WHERE seller_id = 1 AND created >= '2019-09-09 00:00:00'
AND created <= '2019-10-01 00:00:00' ) trade6 USING ( shop_buyer_id ) WHERE shop_buyer_id NOT IN ( SELECT DISTINCT ( shop_buyer_id ) FROM trade_two_2 WHERE seller_id = 1 AND created >= '2018-11-01 00:00:00'
AND created <= '2019-11-11 00:00:00' ) ) UNION ( SELECT DISTINCT ( shop_buyer_id ) FROM trade_two_2 WHERE seller_id = 1 AND created >= '2020-11-10 00:00:00'
AND created <= '2020-11-11 00:00:00'
)
) a ON a.shop_buyer_id = shop_buyer_table.shop_buyer_id
WHERE
seller_id = 1
AND ( sms_marketing_count = '2' )
AND '1_6f822712-3d34-45bb-b25d-3aa5228ae85b_9' = '1_6f822712-3d34-45bb-b25d-3aa5228ae85b_9'
其中的shop_buyer_two_1,trade_two_2,short_link_mobile_two_5的数据量均达到300w以上。
该功能就是卖家能够从交易表trade_xxx、会员表shop_buyer_xxx、子订单表order_、短链接表short_link_mobile_xxx中筛选符合条件的会员进行营销。
大致需求如下图:
从图中可以得知:
1、INNER JOIN 尽量让比较小的表做主表。
2、尽量少连表查询。
但是目前需求下,大表联合查询,依旧是太慢了。
从上我基本可以确定需求最终目的是通过各种交并差操作交易、会员信息,筛选出会员。
既然是交并差,那么是否可以考虑一下不用MySQL的连表查询,而是将交并差的操作交给Redis来完成。。
Redis交并差:sinterstore、sunionstore、sdiffstore
解决方案:
将上面的查询语句,组装成为多个单表查询语句。
经验证后:
Redis的Set集合结构交并差只需几秒(比连表查询几分钟都出不来的)确实可以实现需求,但是Set结构占用的内存非常高。
Set集合存1000w个20位的会员ID就占用了850M左右内存,这很烧内存呀。同比的List结构只需要90M左右,但是并没有相关的交并差操作API。
Redis客户端登录,info命令查询内存占用,然后运行测试代码:
public static void testRedis() {
String host = "localhost";
int port = 6379;
Jedis jedis = null;
String key = "@Test:mySet";
try {
jedis = new Jedis(host, port);
// jedis.auth("123456");
jedis.select(1);
// 2. 保存数据
jedis.set("name", "imooc");
// 3. 获取数据
String value = jedis.get("name");
System.out.println(value);
jedis.del(key);
jedis.flushDB();
jedis.flushAll();
Pipeline pl = jedis.pipelined();
for (int i = 0; i < 10000000; i++) {
// pl.sadd(key, 10000000 + i + "123456789");
pl.rpush(key, 10000000 + i + "123456789");
}
pl.sync();
// Long scard = jedis.scard(key);
// System.out.println("scard:" + scard);
Long llen = jedis.llen(key);
System.out.println("llen:" + llen);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (null != jedis) {
try {
jedis.close();
} catch (Exception e) {
System.out.println("redis连接关闭失败");
e.printStackTrace();
}
}
}
}
Redis的效果可以满足我们的需求,但是占用内存太高了。如果来10几个的1千万会员的卖家,那就有点烧钱了。
经过搜索,我们找到了另外一种交并差的方式,那就Java的Set集合的交并差。
Java的Set集合:
Java的2千万会员ID只需占用60M左右,耗时:4s左右
public class SetOptUtils {
/**
* 取两数交集.
*/
public static <T> void intersect(Set<T> resultSet, Set<T> tmpSet) {
if (resultSet.size() <= tmpSet.size()) {
resultSet.retainAll(tmpSet);
} else {
tmpSet.retainAll(resultSet);
resultSet.clear();
resultSet.addAll(tmpSet);
}
}
/**
* 取两数并集.
*/
public static <T> void union(Set<T> resultSet, Set<T> tmpSet) {
resultSet.addAll(tmpSet);
}
/**
* 取两数差集(减法).
*/
public static <T> void diff(Set<T> resultSet, Set<T> tmpSet) {
resultSet.removeAll(tmpSet);
}
public static void main(String[] args) {
Runtime r = Runtime.getRuntime();
r.gc();
long startRAM = r.freeMemory();
Set<String> list1 = new HashSet<>();
Set<String> list2 = new HashSet<>();
int size = 20000000;
for (int i = 0; i <= size; i++) {
list1.add("a" + i);
if (list2.size() <= 10000000) {
list2.add("a" + i);
}
}
long endRAM = r.freeMemory();
String result = "测试RAM结束,测试占用内存空间约为 : " + (startRAM - endRAM);
System.out.println(result);
System.out.println("SetOptUtils.............");
long start2 = System.currentTimeMillis();
SetOptUtils.intersect(list1, list2);
long end2 = System.currentTimeMillis();
System.out.println(list1.size());
System.out.println(list2.size());
System.out.println("SetOptUtils 消耗时间(数量:" + size + "):" + (end2 - start2));
}
}
从上可以看出Java的Set集合和耗时都比较符合当前需求。
最终方案:将所有的单表查询入缓存Redis存储List结构,每次从Redis取出2个集合进行交并差,最后将处理完毕结果到Redis中即可。
其实这种方案实时性确实有点慢,如果实时性不是很强烈的话,此方案还是有可借鉴之处。如有需要更实时的可以考虑一下Elasticsearch。