昨天在 Hacker News 首頁上看到「Tracking the Fake GitHub Star Black Market (dagster.io)」這篇,原文在「Tracking the Fake GitHub Star Black Market with Dagster, dbt and BigQuery」這邊。
作者群想要偵測 GitHub 上面 fake star 的行為,所以就跑去找黑市買,然後找到了兩家,Baddhi Shop (1000 個 $64) 與 GitHub24 (每個 €0.85,大約是 $0.91),價錢差異很大,「品質」差異也很大:貴的 star 在一個月後還是存在,而便宜的看起來有一些有被 GitHub 偵測到而清除掉:
A month later, all 100 GitHub24 stars still stood, but only three-quarters of the fake Baddhi Shop stars remained. We suspect the rest were purged by GitHub’s integrity teams.
接下來就是想要系統化分析,切入點是 GH Archive 這個服務,可以直接下載 GitHub 全站上的 public evnets 資料:
GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
想要偵測兩種不同的 fake account,第一種是 obvious fake account,定義成這樣:
- Created in 2022 or later
- Followers <=1
- Following <= 1
- Public gists == 0
- Public repos <=4
- Email, hireable, bio, blog, and twitter username are empty
- Star date == account creation date == account updated date
從定義就可以看出來完全就是灌水帳號,開出來就沒在動的。從 screenshot 可以看出這種帳號長的都一樣:
另外一種則是透過演算法去分析,這邊拿 unsupervised clustering 類的演算法分析出來的結果,可以看到抓到很多:
最近 NN 類的 machine learning 演算法太多,看到這些傳統的 machine learning 演算法還是覺得頗新鮮的…