Word2Vec工具的出处在一篇论文《Efficient Estimation of Word Representations in Vector Space》中

从论文中可以找到提供的开源地址,但是发现跳转网页的时候会显示404,于是换了一种方法安装Word2Vec

作者提供的开源地址中,提供了论文的源码,解压源码的压缩包然后用Pycharm打开,然后用Pycharm切换到Word2Vec环境(如果发现解压错误需要先新建一个名为Word2Vec的文件夹)

屏幕截图 2023-07-03 095919

打开后找到作者写的README文件

Tools for computing distributed representtion of words
------------------------------------------------------

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous
Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:
- desired vector dimensionality
- the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
- training algorithm: hierarchical softmax and / or negative sampling
- threshold for downsampling the frequent words
- number of threads to use
- the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets.

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training
is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/

翻译一下后

计算单词的分布式表示的工具
------------------------------------------------------

我们提供了一个连续词袋(CBOW)和跳格模型(SG)的实现,以及几个演示脚本。

给定一个文本语料库,word2vec工具为词汇库中的每个词学习一个向量,使用连续词袋或跳格模式。
字袋或跳格神经网络结构为词汇中的每个词学习一个向量。用户应该指定以下内容:
- 希望的向量维度
- Skip-Gram或Continuous Bag-of-Words模型的上下文窗口的大小
- 训练算法:层次化的softmax和/或负采样
- 对频繁出现的词进行下采样的阈值
- 要使用的线程数
- 输出词向量文件的格式(文本或二进制)。

通常情况下,其他超参数,如学习率,不需要针对不同的训练集进行调整。

脚本demo-word.sh从网上下载了一个小型(100MB)文本语料库,并训练了一个小型的词向量模型。训练结束后
训练完成后,用户可以交互式地探索这些词的相似性。

关于这些脚本的更多信息,请见https://code.google.com/p/word2vec/

那就先跑一下demo-word.sh这个脚本试试吧,首先用万能的chatGPT把其转换成demo-word.bat脚本

@echo off
make

if not exist text8 (
powershell -Command "& { Invoke-WebRequest http://mattmahoney.net/dc/text8.zip -OutFile text8.gz }"
powershell -Command "& { Expand-Archive -Path text8.gz -DestinationPath . -Force }"
)

time word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

distance vectors.bin

跑一下试试吧

image-20230703100224888

按照提示改改吧

@echo off

if not exist text8 (
powershell -Command "& { Invoke-WebRequest http://mattmahoney.net/dc/text8.zip -OutFile text8.zip }"
powershell -Command "& { Expand-Archive -Path text8.zip -DestinationPath . -Force }"
)

time word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

distance vectors.bin

再跑试试

image-20230703100810359

好像是distance vectors.bin不太对,按照原来的脚本改成./distance vectors.bin试试

跑了一下还报错,用一天一次的chatGPT-4转换了一下康康

@echo off
make

if not exist text8 (
powershell -Command "Invoke-WebRequest -Uri 'http://mattmahoney.net/dc/text8.zip' -OutFile 'text8.gz'"
gzip -d text8.gz -f
)

echo Calculating word vectors...
powershell -Command "Measure-Command { .\word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 } | Select-Object TotalSeconds"

echo Running distance...
.\distance vectors.bin

然后把.gz换成zip试试

@echo off
make

if not exist text8 (
powershell -Command "Invoke-WebRequest -Uri 'http://mattmahoney.net/dc/text8.zip' -OutFile 'text8.zip'"
gzip -d text8.zip -f
)

echo Calculating word vectors...
powershell -Command "Measure-Command { .\word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 } | Select-Object TotalSeconds"

echo Running distance...
.\distance vectors.bin

image-20230703101439557

居然出来一个神奇的结果,再改改删删试试

@echo off

if not exist text8 (
powershell -Command "Invoke-WebRequest -Uri 'http://mattmahoney.net/dc/text8.zip' -OutFile 'text8.zip'"
zip -d text8.zip -f
)

echo Calculating word vectors...
powershell -Command "Measure-Command { .\word2vec\truck -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 } | Select-Object TotalSeconds"

echo Running distance...
.\distance vectors.bin

好吧,还是上面那个错

又改了改

@echo off

if not exist text8 (
powershell -Command "Invoke-WebRequest -Uri 'http://mattmahoney.net/dc/text8.zip' -OutFile 'text8.zip'"
zip -d text8.zip -f
)

echo Calculating word vectors...
powershell -Command "Measure-Command { .\word2vec.c -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 } | Select-Object TotalSeconds"

echo Running distance...
.\distance.c vectors.bin

跑一下试试

image-20230703102332133

然后还突然弹出来我的vs code,就是这个./distance怎么老报错啊

在问问万能的chatGPT

image-20230703102608238

但是好像我没有distance这个文件啊,搜一下康康

image-20230703125901033

好吧,看来还是有的,那就简单了,再把脚本修改一下

@echo off

if not exist text8 (
powershell -Command "Invoke-WebRequest -Uri 'http://mattmahoney.net/dc/text8.zip' -OutFile 'text8.zip'"
zip -d text8.zip -f
)

echo Calculating word vectors...
powershell -Command "Measure-Command { .\word2vec.c -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 } | Select-Object TotalSeconds"

echo Running distance...
.\distance.c vectors.bin

跑一次试试

image-20230703130041541

ohhhh这是运行完了?再康康文档对这个demo的解释吧

脚本demo-word.sh从网上下载了一个小型(100MB)文本语料库,并训练了一个小型的词向量模型。训练结束后
训练完成后,用户可以交互式地探索这些词的相似性。

好像怎么也得有一点结果吧,我感觉是本来要在cmd里运行的.c文件,结果却用vscode打开了

试试在脚本里能不能修改一下,继续求助chatGPT

@echo off
gcc -o your_program your_program.c
your_program.exe

请注意,your_program应替换为您的C程序的名称,your_program.c应替换为您的C源代码文件的名称。

照猫画虎的改改试试

@echo off

if not exist text8 (
powershell -Command "Invoke-WebRequest -Uri 'http://mattmahoney.net/dc/text8.zip' -OutFile 'text8.zip'"
zip -d text8.zip -f
)

echo Calculating word vectors...
powershell -Command "Measure-Command { word2vec.c -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 } | Select-Object TotalSeconds"
word2vec.exe

echo Running distance...
gcc -o distance distance.c
vectors.bin
distance.exe

image-20230703130956790

emmmm意料之中,还是不能自己改啊,继续用chatGPT逐句改吧

@echo off

if not exist text8 (
powershell -Command "Invoke-WebRequest -Uri 'http://mattmahoney.net/dc/text8.zip' -OutFile 'text8.zip'"
zip -d text8.zip -f
)

echo Calculating word vectors...
gcc -o word2vec word2vec.c
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

echo Running distance...
gcc -o distance distance.c
./distance vectors.bin

跑一跑试试

image-20230703132032334

感觉是用gcc编译运行了,但是好像是输入的时间变了又不对了。试试自己能不能把两个结合一下

@echo off

if not exist text8 (
powershell -Command "Invoke-WebRequest -Uri 'http://mattmahoney.net/dc/text8.zip' -OutFile 'text8.zip'"
zip -d text8.zip -f
)

echo Calculating word vectors...
"Measure-Command {
gcc -o word2vec word2vec.c
--train text8 ^
--output vectors.bin ^
--cbow 1 ^
--size 200 ^
--window 8 ^
--negative 2 ^
--hs 0 ^
--sample 1e-4 ^
--threads 20 ^
--binary 1 ^
--iter 15 ^
}"

echo Running distance...
gcc -o distance distance.c
--output vectors.bin

好吧,还是崩了。那就直接用visual studio跑一下代码试试吧

ps:如果显示找不到启动项可以试试这个方法

image-20230703152429717

打扰了…找一个视频吧,我记得哔哩哔哩大学有使用这个工具的教程