To explore the HC Corpora dataset (Blogs, News, and Twitter) and establish a foundation for a high-performance next-word prediction algorithm.
Dataset : HC Corpora (EN-US) containing over 4M lines
of text.Sampling : 1% random sample used for computational
efficiency and reproducibility.Tools : tidytext for tokenization and ggplot2 for
frequency visualization.Scale : The raw text files exceed 550MB,
necessitating an optimized N-gram dictionary.Patterns : N-gram distributions follow Zipf’s Law,
allowing for significant data pruning without sacrificing accuracy.Strategy : A Katz Back-off model is identified as the
most efficient approach for the final Shiny application.Before cleaning, I analyzed the raw files to understand their scale. The volume of data requires an efficient sampling strategy to maintain sub-second prediction latency in the final product.
file_size_blogs <- file.info("en_US.blogs.txt")$size / 1024^2
file_size_news <- file.info("en_US.news.txt")$size / 1024^2
file_size_twitter <- file.info("en_US.twitter.txt")$size / 1024^2
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
summary_table <- data.frame(
File = c("Blogs", "News", "Twitter"),
Size_MB = c(file_size_blogs, file_size_news, file_size_twitter),
Line_Count = c(length(blogs), length(news), length(twitter)),
Word_Count = c(sum(stri_count_words(blogs)), sum(stri_count_words(news)), sum(stri_count_words(twitter)))
)
kable(summary_table, caption = "Summary of Raw Data Files (English US)", digits = 2)| File | Size_MB | Line_Count | Word_Count |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546806 |
| News | 196.28 | 77259 | 2674561 |
| 159.36 | 2360148 | 30096690 |
To ensure the algorithm remains fast and responsive, I extracted a 1% random sample. The cleaning pipeline included: conversion to lowercase, removal of punctuation and numbers, and filtering of excess white space.
I analyzed N-grams, which are sequences of words. This identifies the probability of word transitions—the core logic of my predictor.
Bigrams are the foundation of my next-word prediction. For example, the model identifies “of the” and “in the” as highly frequent anchors.
bigrams <- sample_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE) %>%
top_n(15, n)
ggplot(bigrams, aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "#e74c3c") +
coord_flip() +
labs(title = "Top 15 Most Common English Bigrams", x = "Bigram", y = "Frequency") +
theme_minimal()I will implement a Katz Back-off model.
Trigram Match : System first looks for a match based
on the last two words typed.Bigram Match : If no trigram is found, it “backs off”
to a one-word match.Unigram Match : Fallback to the most frequent word in
the corpus.Pruning : Remove N-grams that appear only once
(singletons).Data Structure : Use .rds files for fast loading and
low memory footprint.The full source code for this project and the final Shiny Application is available on GitHub: View Source on GitHub