Author : Harsh Mehta
Date : Sun 22 May 2022
Subject : Markov Chains
Dataset : Quora Dataset (available on kaggle)
import pandas as pd
import markovify
df = pd.read_csv("train.csv")
df
qid | question_text | target | |
---|---|---|---|
2 | 0000412ca6e4628ce2cf | Why does velocity affect time? Does velocity a... | 0 |
3 | 000042bf85aa498cd78e | How did Otto von Guericke used the Magdeburg h... | 0 |
4 | 0000455dfa3e01eae3af | Can I convert montra helicon D to a mountain b... | 0 |
5 | 00004f9a462a357c33be | Is Gaza slowly becoming Auschwitz, Dachau or T... | 0 |
6 | 00005059a06ee19e11ad | Why does Quora automatically ban conservative ... | 0 |
... | ... | ... | ... |
1306117 | ffffcc4e2331aaf1e41e | What other technical skills do you need as a c... | 0 |
1306118 | ffffd431801e5a2f4861 | Does MS in ECE have good job prospects in USA ... | 0 |
1306119 | ffffd48fb36b63db010c | Is foam insulation toxic? | 0 |
1306120 | ffffec519fa37cf60c78 | How can one start a research project based on ... | 0 |
1306121 | ffffed09fedb5088744a | Who wins in a battle between a Wolverine and a... | 0 |
1306120 rows × 3 columns
df.dropna()
qid | question_text | target | |
---|---|---|---|
2 | 0000412ca6e4628ce2cf | Why does velocity affect time? Does velocity a... | 0 |
3 | 000042bf85aa498cd78e | How did Otto von Guericke used the Magdeburg h... | 0 |
4 | 0000455dfa3e01eae3af | Can I convert montra helicon D to a mountain b... | 0 |
5 | 00004f9a462a357c33be | Is Gaza slowly becoming Auschwitz, Dachau or T... | 0 |
6 | 00005059a06ee19e11ad | Why does Quora automatically ban conservative ... | 0 |
... | ... | ... | ... |
1306117 | ffffcc4e2331aaf1e41e | What other technical skills do you need as a c... | 0 |
1306118 | ffffd431801e5a2f4861 | Does MS in ECE have good job prospects in USA ... | 0 |
1306119 | ffffd48fb36b63db010c | Is foam insulation toxic? | 0 |
1306120 | ffffec519fa37cf60c78 | How can one start a research project based on ... | 0 |
1306121 | ffffed09fedb5088744a | Who wins in a battle between a Wolverine and a... | 0 |
1306120 rows × 3 columns
n = df.question_text.nunique()
print(n, "unique categories")
1306120 unique categories
model = markovify.NewlineText(df.question_text, state_size=2)
model2 = markovify.NewlineText(df.question_text, state_size=2)
model3 = markovify.NewlineText(df.question_text, state_size=2)
ensemble_model = markovify.combine([ model, model2, model3], [1, 1, 1])
for i in range(10):
print(ensemble_model.make_sentence())
How do I tell my relatives on the hub? How can I pay a bulk transcoder tool for A/B testing? What is the smartest? Can Afghan cricket players and teams? Do you have cancer? What should I choose business school? Is it bad to use North Korean leader two-faced? Why is United Nations and the United States? Should I buy more than Democrat voters? Can the killing of another essay using original designs, would it be, and why?