|
本帖最后由 〇〇 于 2024-12-23 09:08 编辑
我把我的等值连接的代码和newkid的在duckdb做了对比,都能hash join
1000车100w人
〇〇 3.108
newkid1 1.815s
newkid2 1.287s
10000车1000w人
〇〇 180s没出来
newkid2 107.867s
4G内存好像有些地址都坏了
- with p as (
- select passenger.*
- ,row_number() over(partition by departure_station,arrival_station order by passenger_id) rn ---- 为每组起终点的乘客编一个序号
- from passenger)
- ,tr as (
- select tr1.*
- ,s2-chunk_count+lvl as chunk_id ---- 把每趟车复制成多行,每个分块一行,并进行编号
- ,s2*20-chunk_count*20 as s1 ----- 本车次座位序号的下限,上限是s2
- from (select train_id,departure_station,arrival_station,flag
- ,case when flag=0 then seat_count/20 else trunc(seat_count*0.1/20) end as chunk_count ---- flag=0为原车次记录,flag=1为无座票,容量只有10%, 这样排序时原车优先于无座
- ,sum(case when flag=0 then seat_count/20 else trunc(seat_count*0.1/20) end) over(partition by departure_station,arrival_station order by flag,train_id) as s2 ---- 把座位数按20分块,然后做滚动累计
- from train t
- cross join (select 0 flag union all select 1 ) ---和两条记录的小集合做笛卡尔积,生成额外的车次记录
- ) tr1
- ,lateral(select i lvl from range(1,chunk_count::int+1)t(i)) ---- lateral view会把每一行数据复制成多行
- )
- select p.passenger_id,p.departure_station,p.arrival_station
- ,tr.train_id
- ,case when tr.flag=0 then ceil((p.rn-tr.s1)/100)::int end as coach_number ---- 算出车厢号
- ,case when tr.flag=0 then
- -- mod(p.rn-tr.s1-1,100)+1 ---得到1-100的编号
- ceil((mod(p.rn-tr.s1-1,100)+1)/5)::int||SUBSTR('FABCE',mod(mod(p.rn-tr.s1-1,100)+1,5)::int+1,1)
- when tr.flag=1 then ' no seat'
- end as seat_number
- from p left join tr on p.departure_station =tr.departure_station and p.arrival_station =tr.arrival_station
- and ceil(p.rn/20) = tr.chunk_id ----- 建立乘客序号和座位序号的映射关系,只需将乘客号除以20并座取整
- order by passenger_id;
- │ P00121772? │ 杭州 │ 杭州96 │ G86? │ 3 │ 8C │
- │ P00121772¢ │ 杭州 │ 杭州97 │ G86¢ │ 3 │ 8C │
- │ P00121772£ │ 杭州 │ 杭州98 │ G86£ │ 3 │ 8C │
- │ P00121772¤ │ 杭州 │ 杭州99 │ G86¤ │ 3 │ 8C │
- ├──────────────┴───────────────────┴─────────────────┴──────────┴──────────────┴─────────────┤
- │ 10000000 rows (40 shown) 6 columns │
- └────────────────────────────────────────────────────────────────────────────────────────────┘
- Run Time (s): real 107.867 user 43.265625 sys 19.109375
复制代码
补充,速度慢是因为机器差,内存没有坏,是我生成数据用chr()的参数超过127了
在麒麟990上(8核)
newkid方法2 ,4线程 2.341s
newkid方法2 ,8线程 2.034s
|
|