Doris 作业案例 电脑版发表于:2025/4/2 0:03  >#Doris 作业案例 [TOC] 打地鼠案例 ------------ tn2>需求:连续4次命中的人 seq:第几次打地鼠 m:是否命中,1-> 命中,0 -> 未命中 下面是`hit_mouse.txt`数据文件: ```sql uid,seq,m u01,1,1 u01,2,0 u01,3,1 u01,6,1 u02,5,1 u02,6,0 u02,7,0 u02,1,1 u02,2,1 u03,4,1 u03,5,1 u03,6,0 u02,3,0 u02,4,1 u02,8,1 u01,4,1 u01,5,0 u02,9,1 u03,1,1 u03,2,1 u03,3,1 ``` tn2>执行命令进行创建表。 ```sql create table hit_mouse ( user_id varchar(50), seq int, m int ) DUPLICATE KEY(user_id) DISTRIBUTED BY HASH(user_id) BUCKETS 1 PROPERTIES("replication_num" = "1"); ```  tn2>使用命令添加文件数据。 ```sql curl -u root: -H "label:hit_mouse" -H "column_separator:," -T "D:\dockerdata\Doris\hit_mouse.txt" http://127.0.0.1:8040/api/test/hit_mouse/_stream_load ```  ### 方案一 tn2>1.首先排除没有命中的数据 ```sql select user_id,seq,m, row_number()over(partition by user_id order by seq asc) as num from hit_mouse where m != 0 ; ```  tn2>2.将上面的语句改变下,现在需要的是编号减去行号,可否直接拿编号减行号呢? ```sql select user_id,seq,m, seq -row_number()over(partition by user_id order by seq asc) as num from hit_mouse where m != 0 ; ```  tn2>看num重复的个数(在同一个user_id中) ```sql select user_id, count(1) as cnt from ( select user_id,seq,m, seq - row_number()over(partition by user_id order by seq asc) as num from hit_mouse where m != 0 ) as t group by user_id,num having cnt>=4 ```  ### 方案二 tn2>方案二:在不需要返回具体连续命中多少次,只需要返回user_id的情况下,还可以这么做 1.在去掉了未命中数据后 2.开窗,拿当前行下面的第三行数据,如果说该用户是连续命中的, 必然下面第三行的序号等于第一行的序号加3,如果结果不等于3,他们必然是不连续的, 并且结果只可能大于3,中间有为没中的被过滤了 3.最后查看结果等于3的用户并返回即可 ```sql select user_id,seq,m, (lead(seq,3,-4) over(partition by user_id order by seq asc) - seq) as diff from hit_mouse where m != 0 ```  tn2>最后判断diff的差值是否=3,然后返回对应的user_id(有可能会有多条相同的数据,去重) ```sql select user_id from ( select user_id,seq,m, (lead(seq,3,-4)over(partition by user_id order by seq asc) - seq) as res from hit_mouse where m != 0 ) as t where res = 3 -- group by 去重 group by user_id ```  连续购买案例 ------------ tn2>创建`shop_sale.txt`数据,数据内容: ```bash a,2017-02-05,100 a,2017-02-06,300 a,2017-02-07,800 a,2017-02-08,500 a,2017-02-10,700 b,2017-02-05,200 b,2017-02-06,400 b,2017-02-08,100 b,2017-02-09,400 b,2017-02-10,600 c,2017-01-31,200 c,2017-02-01,600 c,2017-02-02,600 c,2017-02-03,600 c,2017-02-10,700 a,2017-03-01,400 a,2017-03-02,300 a,2017-03-03,700 a,2017-03-04,400 ``` tn2>创建表 ```bash create table shop_sale ( shop_id varchar(50), dt date, amount double ) DUPLICATE KEY(shop_id) DISTRIBUTED BY HASH(shop_id) BUCKETS 1 PROPERTIES("replication_num" = "1"); ``` tn2>通过本地文件的方式导入数据。 ```bash curl -u root: -H "label:shop_sale" -H "column_separator:," -T "C:\Users\bob.he\OneDrive - Oerlikon Group\Desktop\project\Doris\shop_sale.txt" http://127.0.0.1:8040/api/test/shop_sale/_stream_load ```   ### 方案一 tn2>这样的连续销售记录问题(连续登录问题)和上面打地鼠的需求是一样的 1.按照店铺分组,对日期排序后打行号 2.用日期减去行号,得到的新的日期值,如果新的日期相同的话就代表是连续的 3.统计相同新日期的个数,来判断连续登录了几天 ```bash with t as ( --按照店铺分组,对日期排序后打行号 select * ,row_number() over (partition by shop_id order by dt ) xuhao from shop_sale ),t2 as ( --用日期减去行号,得到的新的日期值,如果新的日期相同的话就代表是连续的 select * ,date_sub(dt,INTERVAL xuhao DAY) dt2 from t ),t3 as ( -- 3. 统计相同新日期的个数,来判断连续登录了几天 select shop_id from t2 group by shop_id,dt2 having count(1) >=3 ) --返回满足条件的店铺名称 select distinct shop_id from t3; ```  ### 方案二 tn2>需要求连续三天的,我们取下面的第二行日期,拿取过来的下面的日期对当前行的日期相减,取间隔几天 如果他们的值 = 2 就代表是连续的 ```bash -- 1. 使用窗口函数 lead() 获取当前行的下两行日期 with t as ( select * ,datediff(lead(dt,2,'9999-12-31') over(partition by shop_id order by dt asc),dt) diffdays from shop_sale -- 计算当前日期与下两行日期的间隔天数 ) select shop_id from t where diffdays = 2 group by shop_id; ```  分组topn案例 ------------ tn2>需求: 基于上面的表,求每个店铺金额最大的前三条订单 (row_number over) 求每个店铺销售金额前三名的订单 ```bash select shop_id,dt,amount from ( select shop_id,dt,amount, row_number()over(partition by shop_id order by amount desc) as num from shop_sale ) as tmp where num <=3; ``` 