记录一次kaggle比赛,这次比赛的场景主要通过MLB球员的历史表现数据、社交媒体数据以及市场规模等团队因素来预测在未来MLB 球员的数字内容互动趋势(社交媒体互动)。建立的模型将预测出MLB球员在未来的数字内容互动趋势指数(target1- target4)。
MLB Player Digital Engagement Forecasting
比赛背景与任务:
A player hits a walk-off home run. A pitcher throws a no-hitter. A team gets red hot going into the Postseason. We know some of the catalysts that increase baseball fan interest. Now Major League Baseball (MLB) and Google Cloud want the Kaggle community’s help to identify the many other factors which pique supporter engagement and create deeper relationships betweens players and fans.
The sport has a long history of being numbers-driven. Nearly every day from at least April through October, baseball fans watch, read, and search for information about players. Which individuals they seek can depend on player performance, team standings, popularity, among other, currently unknown factors—which could be better understood thanks to data science.
Since at least the early 1990s, MLB has led the sports world in the use of data, showing fans, players, coaches, and media what’s possible when you combine data with human performance. MLB continues its leadership using technology to engage fans and provide new fans innovative ways to experience America’s Favorite Pastime.
MLB has teamed up with Google Cloud to transform the fan experience through data. Google Cloud proudly supports this Kaggle contest to celebrate the launch of Vertex AI: Google Cloud’s new platform to unify your ML workflows.
In this competition, you’ll predict how fans engage with MLB players’ digital content on a daily basis for a future date range. You’ll have access to player performance data, social media data, and team factors like market size. Successful models will provide new insights into what signals most strongly correlate with and influence engagement.
Imagine if you could predict MLB All Stars all season long or when each of a team’s 25 players has his moment in the spotlight. These insights are possible when you dive deeper into the fandom of America’s pastime. Be part of the first method of its kind to try to understand digital engagement at the player level in this granular, day-to-day fashion. Simultaneously help MLB build innovation more easily using Google Cloud’s data analytics, Vertex AI and MLOps tools. You could play a part in shaping the future of MLB fan and player engagement.
该赛题为时间序列任务,通过MLB球员的历史表现数据、社交媒体数据以及市场规模等团队因素来预测在未来MLB 球员的数字内容互动趋势(社交媒体互动)。建立的模型将预测出MLB球员在未来的数字内容互动趋势指数(target1- target4)。旨在为MLB 球迷和球员的未来社交媒体互动参与度挖掘价值。
至少从 1990 年代初期开始,美国职业棒球大联盟就在使用数据方面领先于体育界,向球迷、球员、教练和媒体展示了将数据与人类表现相结合的可能性。MLB使用创新技术吸引球迷,并为新球迷提供体验美国最受欢迎的消遣的创新方式。
评价指标:MCMAE 计算四个目标变量中的每一个的平均绝对误差,得分是这四个MAE值的平均值
方案简述
通过竞赛提供的在2021赛季活跃的2055位MLB球员的四种不同的数字内容参与度 ( target1- target4)和对应的球员团队、历史比赛、历史得分情况、所获奖项、比赛事件等累计7.9G的历史数据信息(2021年1-4月)来结合机器学习构建MLB球员未来(2021年5月)数字内容互动趋势指数预测模型。通过季节性EDA、MLB球员历史信息统计后进行特征工程,分别使用ANN(人工神经网络)和LightGBM、CatBoost(集成学习)进行模型融合并对各模型的超参数进行了网格优化后在排行榜取得了铜牌的成绩。
方案流程:
- mlb-ann-training:ANN模型训练代码
- mlb-lightgbm-training:LightGBM模型训练代码
- mlb-catboost-training::CatBoost模型训练代码
- 全流程推理代码(特征提取、超参数调优、模型融合)
Features
I used ~440 features. In addition to joins and asof merge of basic tables, the following features were used:
- lag features per player
- Average of the last 7/28/70/360/720 days
- Average over the on-seasons
- Average for the same period in the previous year
- Average of days with/without a game
- number of events, pitch events, action events
- days from last rosters, awards, transactions and box scores
- sum of box scores in the last 7/30/90 days
- number of games and events in the day
- event-level meta feature
- aggregation of predictions of model trained on event table
- group by (date, playerId), (date, teamId) and (date)
Cumcount Leakage
There is a strange correlation between the cumcount of the dataframe retrieved from the Time-Series API and the target.
I noticed this problem 3 days before the competition ended. I did not post it in the discussion as it might confuse the participants, but contacted the host immediately.
Adding this cumcount to the features only improves the CV a little bit, so it’s probably some kind of artifact or something, but even if it doesn’t improve the CV much, it’s better to shuffle the test data since it’s nonsense that the order of the rows makes sense.
I did not end up using this leak for final submission.
Implementation Note
Building a complex data pipeline in Jupyter Notebook with the Time Series API can be a big pain. I’ll share some of my efforts.
- Maintain the source code on GitHub and paste the BASE64-encoded code into the jupyter notebook
- The inference notebook is also maintained on GitHub and automatically uploaded as the Kaggle Kernel through GitHub Actions
- Avoid the use of pandas and instead use a dictionary of numpy arrays to manage state updates
- Use the same feature generation function for training data and inference
- Both training and test are treated as streaming data, and features were generated using for-loop.
- This is the most important point to get a stable and bug-free data pipeline
- see: https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/196942
- Debug code locally using the API emulator
- Test the robustness of my inference pipeline by “dropout” some of the data returned by the emulator (a kind of “Chaos Engineering”)
- Catch exceptions in various functions and convert them to appropriate “default” values