James Thorne

Next Big Ideas in NLP

James Thorne — Fri, 05 Aug 2022 09:43:24 GMT

This post is a summary of the "Next Big Ideas" session at ACL: established members of the NLP community formed a panel presenting their view of the future of the field.

The session featured a series of talks and a panel from Heng Ji, Mirella Lapata, Dan Roth, Thamar Solorio, Marco Baroni, Hang Li, and Eduard Hovy. The panel was moderated by Iryna Gurevych.

Talk Synopses

Heng Ji: Falling in Love (again) with Structures.

This big idea focused on a previously explored big idea from the early 2000s: cross-lingual information extraction and automated text summarization. Information Extraction and Text Summarization share the same goal: delivering the most important information to the users. However, have different underlying assumptions over structure.

NLP today is sequential and flat - implicit representations, sentence input. Large scale language models work well on multiple tasks and languages without specific annotation required for models. But at these sizes, models can only be trained by large-sized research groups and corporations with the appropriate computing resources.

While research previous research had a goal of extracting information from a the entire corpus, with the contemporary machine-learning methods only sentence-level information extraction was tested and evaluated. However, with the availability of new model types and the ability to capture long-range dependencies, the talk points to a potential future where both document-level and corpus-level information extraction could be performed again.

However, what most large-scale methods (such as BERT-based IE) don't take advantage of is the structures in the inputs or the outputs. However, this is a feature of many tasks (e.g. structured knowledge bases, chemical formulae, event schemas, parse trees etc) and is beneficial to some tasks in low-resource settings.

There are two issues: how to acquire knowledge and how encode it.

Structured data is widely available in natural structures (e.g. the web) or curated forms (such as knowledge graphs). Some of these structures can be learned from large-scale models from unstructured data to further support extraction.

The advantages of structured knowledge are that: it is compositional and can be used for cross-sentence reasoning, generalizable and can be used for synthesis and updating to unseen tasks, transferable serving as a bridge for unseen tasks or languages, explainable giving an intermediate representation that expresses some fact, and also has utility as users can provide feedback and influence the results.

There are many ways to encode this structured too: pre-trained language models provide a mechanism for type discovery and data augmentation, graph neural networks could be used to update embeddings based on network structure, constraints could enforce structure during inference, and self-supervision could be use to align to structures.

The concluding remarks of the talk were that "flat" models in NLP will come and go and that flatness is a property imposed by the models that we are using. However, in the natural world structure exists and will remain for the future.

Mirella Lapata: Paying Attention to Stories

The way children learn about society, how culture is kept alive, and how values are instilled is through stories. Stories are everywhere, e.g. movies, podcasts etc. Understanding the science of stories is well studied and has been the subject of hundreds of books.

Stories have shapes and structure (e.g. orientation, rising action, climax, falling action and resolution) and also themes that are recurrent (e.g. man-vs-beast) is in many stories (e.g. Jaws, Jurassic Park, Beowulf). Mirella's argument is that if we crack the issue of stories, we crack the issue of natural language understanding: it allows us to ask contextual questions over very long passages of texts.

To overcome these challenges, we must consider challenges of data, modelling, and evaluation.

Each book, movie, podcast, or script is a lot of data but also is a single datapoint. It is not feasible to annotate large collections of stories for the myriad of questions that humans would want to ask. As a community we must use automated and self-/semi- supervised methods to make use of the data with minimal human intervention.

Stories exceed the memory capacity of today's conventional modelling choices. For example, models such as transformers have fixed input sizes and do not scale well to large inputs beyond hundreds of tokens. Stories, in contrast, contain thousands of words and have dependencies over many sections. There are many dependent and plot lines, character interactions and causal actions in the stories that must be understood. We need to think about new architectures with models that interact: some models would be in general purpose, some would be specialist.

The talk then concludes by pointing to resources in the NLP community including workshops of storytelling, narrative understanding, and automatic summarisation for creative writing.

Dan Roth: It's time to reason

A lot of machine learning based approaches to question answering exploit patterns and regularities in the inputs for a learning signal. Asking questions such as "what day of the year is the longest in Boston?" would give the same answer as if the question is also about New York. But if you change the input to ask a question about Melbourne, for example, the expected answer should be different. Learning these connections from a corpus of text only facts only allows a model to memorize existing facts and cannot generalize to unseen cities or places where the latitude/longitude is needed to condition the generation of the answer.

At the heart of this process is a planning process which is needed to reason which information is needed, find the correct information, and to generate the answer from a set of ground truth information. There are a lot of reasoning questions that depend on combining multiple pieces of world knowledge that is explicitly written and tacit knowledge that is not always expressed. An agent must devise a plan and access external resources to reason about an answer. We need to extract, contextualize and scope the answers, reason about quantities, and provide a reason about why we gave an answer (reasoning is about the reason).

The second point of Dan Roth's talk was to also consider how we supervise model training. Data for most tasks is collected by simple minded annotators performing data collection and labelling for a single objective. However, in childhood, humans learn through correlating different expressions of the same signal or event. Dan introduced a potential form of supervision for models as "Incidental Supervision". There can be many forms of incidental supervision and one example he gave was learning from texts with the same meaning. One event may be expressed in multiple different sources with differentiation.

What we do today is not sufficient to deal with the sparse world both for reasoning or for supervising training.

Thamar Solorio: Human Level Multilinguality

Half of the world's population use two or more languages in every day life and speakers code-switch between languages in the same utterance. However, NLP technology caters to monolingual speakers. When NLP discusses "multilingual" models, we assume that the input is one language per utterance and the community could further strive for true multilingual NLP technology where more than one language per utterance is used.

Multilingual settings are increasingly relevant for applications of NLP in the community, for example: voice assistants, social media usages, and chatbots. In formal settings for healthcare, this is equally as important. There are several challenges though compounded by limited resources and noise in the data and issues around transliterating and informal mappings between different scripts. Even if we know what languages are going to be mixed, we don't know when and how the languages will be mixed. The diversity of switching depends on the context, the speakers, and the power dynamics of the languages.

There are many related disciplines that computational linguistics and NLP researchers should be collaborating with expert linguists sociolinguists and language acquisition experts. The takeaway is that users shouldn't have to leave their linguistic identities behind when using NLP systems.

Marco Baroni: Unnatural language processing

In natural language processing: we have moved through several methods in recent years from training on specific data, to fine-tuning pre-training models, to massive pre-trained language models without fine-tuning and only using prompt engineering.

Marco's big idea was machine-to-machine bridging models could be used to allow coordination between multiple services that have inputs and outputs in a natural language. The idea seems similar to be a message bus or middleware layer to coordinate many 3rd party services (e.g. LM1 could be a voice assistant and LM2 could be a shopping service). The LMs would be frozen and the M2M models would be learned in-situ.

This is a new task, so this would require new methods for training and collecting data for supervision. There are challenges about understanding whether an interlingua emerges and whether these middle languages are interpretable and compositional. The findings could yield further insights about the properties that are not revealed with natural languages alone.

This idea seems quite futuristic but builds upon research in many related NLP and ML research domains including: Socratic Models, Adaptors, and Prompt Engineering, Deep Net Emergent Communication

Eduard Hovy: Rediscovering the need for representation and knowledge

Natural language processing can be viewed as a transduction task: converting one string or structure to another. If you do the task by hand, it would be useful to have some features that you look out for to help do your job easier. Neural networks help do that in automated way: learning which combinations of the input are useful features for the downstream task. Just like rule-based systems doesn't solve problems completely, neural networks with billions of parameters also haven't solved all of NLP problems completely despite being trained and exposed to a large portion of the web. There are many implicit or tacit facts that must be accounted for that aren't stated clearly in writing.

There are many implicit facts needed to answer these questions or perform the inferences with rare knowledge that doesn't appear in the training data. No knowledge in the language model means that the model cannot learn a transformation to generate the answer. In contrast, people have lived in the world and can perform these inferences intuitively because they've experienced a wide range of day-to-day activities.

There are gaps in reasoning in the models and we must be are of these and identify strategies to fill them. However, we cannot know well in advance what these gaps actually are.

To build systems that understand user's goals, research must be grounded in psychological frameworks. For example, from Maslow's hierarchy of needs or goal hierarchy. You must find a sensible set of set of types that provide a rationale or reason behind the sentence. But this is not something that a machine can easily do.

Similarly in common sense knowledge schemas, Eduard gave an example of the DARPA KAIROS program: building a set of schemas that describe the evolution of events. These schema structures are very difficult to learn from large language models and lot of the set of structures have to be manually added based on human intuition and experiences.

And finally, in groups of people people play different roles in communities. For example in Wikipedia, editors perform different tasks such as checking grammar, verifying facts, and removing vandalism. To identify these sorts of groupings, we need human intervention to build a schema defining these interactions.

For interesting NLP, where information is convoluted or difficult to access, we have to on that human intuition to decide what kind of knowledge we care about, what data contains it, how to make inferences, and how to evaluate it. If we can't find ways to automatically elicit this information, we will reach the same asymptotes that we have seen with rule-based systems, supervised learning, and now with large language models. We need to start thinking again about the structures and knowledge representations we need. NLP must stop being lazy – there's more to understanding language than corpus data alone.

Hang Li: Neural Symbolic Architecture for NLU

In psychology, researchers have proposed a dual system for how we think. Hang Li's proposal was to replicate the System 1 and System 2 duality in Natural Language Processing and combine local reasoning with programs with analogical reasoning with neural representations and combine the output for a final prediction.

In a task such as natural language inference, we perform entailment prediction using neural networks that have no formal well-grounded method for numerical reasoning. The neural based approaches doesn't work well for all types of these numerical problems and cannot always generate an accurate answer. Similar architectures are used for tasks such as question answering. Again, seq2seq neural models do not explicitly model numerical reasoning.

As an alternative to using the analogical reasoning that is common in NLI and QA models, we could exploit program-based methods for explicit reasoning over numerical values: translating texts into programs. The encoder and decoder act as a translation from input text into programs.

A mixture of experts can combine both systems. Experimental results show improvements for combinations of models than either approach alone. However there's a lot of challenges around data generation and labelling that need to be addressed.

Slides are content copyright of the respective authors and are not subject to the creative-commons license of this website content.

Learned Incremental Representations for Parsing: Studying the ACL 2022 Best Paper

James Thorne — Thu, 04 Aug 2022 13:19:05 GMT

This year's ACL 2022 best paper was on the topic of incremental parsing. Parsing is something that's been well studied in the NLP, Computational Linguistics fields before. So why does this year's best paper laud a work on a topic that some might have considered out of fashion since deep learning and transformers have taken over the majority of NLP tasks? This blog post will describe the approach taken by the authors and highlight some of my thoughts on the positives and negatives and what other authors submitting to *CL venues could adopt to improve their papers in the future.

Avoiding speculative parsing

The contribution of this paper is a mechanism that avoids speculation when ambiguity is encountered in a sentence. In the example in the paper and talk, the authors demonstrate how left-to-right parsing of a sentence introduces two valid branches which must both be considered. One path is valid, but the other would lead an incremental parser to a dead-end.

The goal of the paper is to obviate the need for speculation in incremental parsing. Speculation is computationally expensive and requires committing to a decision with incomplete evidence.

Conventional incremental parsing uses a tree equivalent schema. Each token in the sentence is assigned a label and that sequence of labels yields a set of instructions that can be directly compiled into a tree. The issue with this approach is that in some cases, the model must commit to one of these labels in the presence of incomplete evidence – which may yield an incorrect tree.

While in previous work the schema is hand-defined, the novelty of this paper is to learn a latent set of actions that is tree equivalent. The bulk of the computation in previous work is performed in the representation encoder that outputs the tag sequence for the sentence. In contrast, this work learns a tree-equivalent representation that is converted to a tree in an end-to-end manner.

Each token is encoded as a vector that is then quantized at the discretization layer of the encoder module yielding a sequence of numeric tags which are then converted into a tree structure. The encoder is computationally heavy whereas the neural tree converter is lightweight. Despite ambiguity in the sentence, the tag sequence up to and including the ambiguous tag is the same. The different tree structure is only resolved after the final tag is observed. The tags act as representations of the tokens as they arrive, rather than concrete actions.

The model doesn't exceed state of the art accuracy

One fault I find of some papers and the reviewing process in ACL conferences is the need to chase state of the art. The model performs on-par with contemporary approaches, but the accuracy doesn't exceed previous work. Is this a problem? No. While Span classification (Kitaev et al., 2019) and Attach-Juxtapose (Yang and Deng, 2020) have marginally higher F1 scores in tree construction, both of these approaches exploit bidirectional encodings with a BERT model.

What's actually quite exciting is that this paper makes achieves a on-par F1 score despite only using a unidirectional encoder. This means that the parse tree can be constructed while the token sequence is being streamed in, allow applications for real time uses such as voice assistants. What is also interesting is that without the bidirectional encoders, the "state of the art" systems perform less well than the proposed method highlighting a dependency on fixed encoder schemes to "look ahead" rather than incrementally parse the sentence. Other experiments in the paper evaluate the number of tags to use (showing diminishing returns up to 256 tags).

Why is this paper worthy of a best paper award?

I've never received an award, nor been on a judging panel. But there are some positive parts of the paper I don't normally see in other works. If I was on the judging panel, this would have influenced my decision.

"Undoing research" the task setting an established dataset with decades of research in building linearized incremental parsers. Many state of the art methods are a result of manual choices of label schemas as well as advancements in modelling with bidirectional representations. From Shift Reduce Parsing Arc-Eager to Juxtapose, researchers have devised hand-crafted tagsets and schemas to linearize the tree construction process. This paper indicates that better schemas can be learned.
"Faster decoding": without beam search or need for backtracking, the real-time runtime of this algorithm can be much faster. Performing greedy search instead of 5-beam search gives a 5x speedup. The model appears to be trained to induce the optimal path through the tagset and doesn't require the beam search to find this. Similarly, because there is no speculation at the "heavy" encoder phase, there is no backtracking.
Analysis: not all papers perform a rich analysis of why the method works and what can be learned from the technique. This paper has several studies that help the reader better understand why this method is interesting and show patterns and regularities that help justify the design choices in the paper.

Why I stopped using Mendeley as a reference manger

James Thorne — Thu, 19 May 2022 18:19:20 GMT

As any student or academic would tell you, keeping track of references is essential. That's why tools such as Mendeley and Zotero are helpful. They are a place to dump pdfs, collect thoughts, annotate documents and export bibtex files for the next paper you're working on.

From when I was writing my masters thesis, I used Mendeley. This was recently after the publishing giant Elsevier got their hands on it, but before the wave of breaking changes started hitting. I valued having a place to put all my content and notes and the reference manager became a more central part of my academic life. When writing my thesis, the integration with Overleaf was a huge time saver, automatically updating the bibliography file when I added new references, saving the manual process of keeping those files in sync.

I stuck with version 1 well after version 2 was released. But then, I got a new laptop and everything changed for me when I had no other option than to use version 2.

Mendeley used to work well as a standalone app. When writing, I could very quickly find a reference I needed, use the search function and cite the reference I needed without giving a second thought. But, the latest changes and "appification" of the tool has made it unusable.

Now, Mendeley doesn't work offline. This makes it impossible to use when I'm on a train or on a flight to a conference. If I try to view my library without Wifi, I get nothing. This is not a desirable, but it doesn't make it a deal breaker.

Without an internet connection. Mendeley shows you nothing.

My pain-point comes from the need to automatically sync every time the app opens. Syncing is great, it keeps my library up to date (if I add documents from other places or my tablet). But what happens is the search and filtering are disabled until the sync is finished. This can take a few minutes, significantly interrupting my writing flow where I'm twiddling my thumbs or going to Google scholar rather than using the tool where all my references are stored.

When first opening Mendeley, you have to wait for minutes for a sync before you can search.

So, what do you do? Just wait for sync to finish? After syncing the search box will be frozen for an undefined period of time. What's happening here?

Finding another tool

Thankfully, there are large number of reference managers available. Exploring AlternativeTo, shows many alternatives such as Zotero, Qiqqa and more! I look forward to exploring these tools in more depth and seeing how they can support, rather than hinder, my writing style.

I'll start by trying out Zotero, which I'll document fully in another blog post. So far, my first impressions have been pretty positive. I can import my library from Mendeley and I didn't need to create an account before I started using it. This is a great start.

My priorities for my reference manager are:

Good search
Quickly exporting references
Syncing to multiple devices
Robust / exportable annotations / highlights on PDF files

I'll write up my feedback on these after the import has finished.

Cambridge OU12 Module: HPC for deep learning guidebook

James Thorne — Mon, 28 Mar 2022 21:47:23 GMT

OU12 HPC RDP module

HPC best practices for deep learning How not to waste time, money, or resources James Thorne

Google Docs

This guidebook accompanies the short video tutorial for the OU12 module. It contains all the scripts and examples that support the material and learning points in the video. It assumes that you already have basic familiarity with the cluster, which you can learn about here: https://docs.hpc.cam.ac.uk/hpc/user-guide/quickstart.html

Tip 1: Quick experimentation, without using the head nodes

The HPC cluster has two classes of nodes: worker nodes and login (or head) nodes. Generally, it is not advisable to run any long-running or resource intensive scripts on these.

For testing and evaluating your scripts, the interactive running mode will provide near-instant access to the resource you need (including GPUs) on a worker-node for up to one hour.

For submitting a batch script, or for using the interactive terminal, the --qos flag will change the job priority to allow for testing.

sintr --qos=INTR [args] # For interactive shell
sbatch --qos=INTR [args] # To submit batch job

Tip 2: Don't duplicate your scripts. Use the same scripts for your machine and the cluster

Making copies of scripts, different versions for different experiments, and hard-coding lots of parameters is an easy way to make mistakes. As you add more and more scripts to your project, testing different configurations, the number of scripts that you need to update and maintain can become a problem.

Think back to separation of concerns principle. By separating out experiment configuration and platform-specific (e.g. laptop vs HPC cluster) behaviours, most of the headaches you'd expect to encounter can be mitigated. When you submit jobs to sbatch, the script file is just a bash script with some metadata in a comment block at the top. In principle, this should run on your machine in exactly the same way. In my experiments, I set platform specific behaviorus (cluster vs laptop), the project, and the configuration in separate files allowing for greater portability.

Specific configuration for the cluster may involve:

Path to python executables / virtual environment
Where to load data from
Where to store results and log files
Network port for DDP communication

Most of these, however, shouldn't be hard coded in your bash file. And if they are, relative paths and directories allow for greater portability.

Setting paths before submitting job

The Python path can be set using the $PATH environment variable which could be set in your ~/.bashrc file on the cluster at login. Or if you have a project-specific installation, this could be set in a shell script and used to configure the environment. Similarly, any working folders or data directories can be set as environment variables:

For example:

# project_setup.sh
#!/bin/bash
export PATH=$PATH:~/path/to/my/env 
export DATA_DIR=~/rds/hpc-work/project_data

Then running the following when submitting jobs for your project.

source ./project_setup.sh

Unless set otherwise, SLURM will preserve your environment when running jobs meaning that any environment variables or paths that are set before submitting the job will be passed through. It is also possible to manually set environment variables with SLURM, but this can be a bit of a pain.

Passing arguments to sbatch

Where arguments can be passed to conventional bash scripts: the same behaviour can be used with sbatch. For example creating a bash script called 'test.sh'.

#!/bin/bash
echo $1 
echo $2

And running it as bash test.sh hello world will output the following:

hello
world

Similarly, submitting the same bash script as a job with sbatch (assuming the accounting info, runtime, etc are provided), will output the same result. For example: sbatch [...args...] test.sh hello world. This allows you to easily switch different config files for experiments, or manually override parameters without having to duplicate your script.

Setting default behaviour within job script

Optional parametrs in bash can be configured to set default behaviours when environemnt variables of paths are not set. The same pattern ${VAR_NAME:-DEFAULT_VALUE} in bash can be used to set config when they are not explicitly set. In the first line, the first argument to the script is set to be the config file bash myscript.sh custom_config.json. If this is not set, the default value of config.json is used. The second example, the batch size is read from an environment variable: this might be useful when testing on a GPU with smaller memory than what is available on the cluster for example: BATCH_SIZE=4 bash myscript.sh.

CONFIG_FILE=${1:-config.json}
BATCH_SIZE=${BATCH_SIZE:-16}

Tip 3: Run end-to-end on a small dataset to identify errors early

Sometimes small changes in how data is loaded or pre-processed can cause failures that don't manifest until after training has finished (for example, not saving the weights file to the correct directory), wasting countless hours of both your time (waiting for results) as well as the resources on the cluster (such as quota, priority, or limits). It's essential to run the scripts end-to-end, including training and evaluation, on a small subset of the data before training on a large-dataset. To ensure that every experiment is tested, this should (ideally) be performed as part of your main bash or python script file.

The starting point is to make sure that things work locally before you even submit your job. Debugging on the cluster can be a bit of a pain, so if you can fix most of the faults locally before running your script most of the faults you’re likely to encounter on the cluster should be simple to diagnose and recover. Following tip #2 can help with this.

But, on the cluster, faults always happen when you least expect them.
They can occur in any part of your code: whether its running out of GPU memory, having a missing file, not being able to parse your dataset, checkpoints not saving, or being corrupted upon saving. These behaviours may be different on the cluster and it's important to make sure all aspects of the script work on the cluster before spending a lot of money / credit training on the full datasets.

There are two ways to check: editing your python code to test these functions, or simply doing a dry run with exactly the same code. Some libraries do some basic smoke tests with the first option, for example, performing an evaluation run with a sample of data before training. But this does not capture all aspects of the dataset. The second option is the easier: training and evaluating on a small sample of your data. A small sketch of the idea is provided below.

head -n 100 data/my_dataset.jsonl > data/sanity_check_data.jsonl
python train.py data/sanity_check_data.jsonl output/sanity_check
python evaluate.py data/sanity_check_data.jsonl output/sanity_check
python train.py data/my_dataset.jsonl output/full_model

Tip 4: Checking for failure

There's a number of ways your script can fail: whether its your script crashing or terminated by the scheduler (such as running out of time or being preempted). Its best pratice to listen for these events so that you can debug your script if it crashes, or resubmit the job if it runs out of time.

Regularly checkpointing helps you limit the damage of unscheduled termination of your job. There’s a lot of things on the cluster that may be outside of your control that that can go wrong too, such as running out of disk space or having the network file system go down. All of these can be quite disruptive and mean that if your script isn’t able to handle those events, you could lose a few hours of work and struggle to work out which jobs completed OK and which ones need to be resubmitted or re-run.

In other facilities, not at Cambridge, higher priority jobs submitted to the queue would cause low priority jobs to terminate. Right before conference deadlines, the last thing you’d want is to struggle to re-train your entire model. Being able to save to and load from checkpoints is a great timesaver that would allow you to re-submit your job and resume from where you left off, assuming your code allows for it.

Checking for crashes

There's a couple of easy ways to check whether your code finished successfully: the simplest is to check for a non-zero exist code in bash and logging it, and using it to mark your job as error. In linux, processes return a non-zero exit code if something went wrong. This is returned as a special environment variable denoted by a question mark which will be zero if the process was OK and something else otherwise.

ERROR_CODE=$?
if ! [ $ERROR_CODE == 0 ]
then
 echo "Non-zero exit code " $ERROR_CODE
 exit $ERROR_CODE
fi

The other option is to save a file after your script has completed everything it needs to (such as training or testing a model), with the results and checking to see if this file exists.

Listening for messages from the scheduler

The scheduler can terminate your job if it is running out of time. Sadly, on the HPC cluster at Cambridge, jobs have a fairly small time limit of 12 or 36 hours depending on the service level. Your script can be set up to listen for a message if it is about to be terminated and used to save a checkpoint or even requeue the job. These messages can be handled by your python code or by your bash script.

The scheduler can be configured to send a signal T seconds before your job runs out of time. This makes it really useful for making jobs which will run beyond the time limits that have been configured here at Cambridge. You can also manually send a signal with the scancel command too. The argument --signal XXX@TTT can be set to send signal 10 (SIGUSR1) 60 seconds before termination: --signal 10@60 so that your script can resubmit and also save the weights.

More information about handling signals in bash can be found here and more information about handling singals in Python can be found here.

Jobs can be resubmitted in bash by calling scontrol requeue $SLURM_JOB_ID.

Tip 5: Don't be greedy

For managing queue time, you need to choose the right service level and ask for the right resources.

Most of the jobs you’ll submit will either be submitted as SL2 or SL3. SL3 is a free service level which offers a few thousand hours of HPC time per quarter. If you exceed this allowance, you get demoted to SL4 which has a lower priority. If you want a higher priority, you can pay a few pence per hour to use service level 2. SL2 jobs allow more resources and longer runtimes. If you exceed the limits set by the cluster, your job may not be accepted, or can wait forever.

Even though SL2 allows jobs to run up to 36 hours. Smaller jobs can be better. The scheduler will have to reserve time for the entire length of your job. If you choose a shorter runtime, the scheduler will try and fit these smaller jobs in and around larger jobs. If you request only what you need, your job may be run more quickly as the scheduler backfills the queue.

Another aspect of the cluster is that SLURM has a fairshare formula that prioritises jobs based on how much you’ve used before. So if you’re a heavier user, you may be waiting longer. The moral of the story is don’t be greedy. If your job isn’t going anywhere, there probably isn’t much point running it to completion. If you can terminate early, you’ll be saving credits as well as reducing your share of CPU time on the cluster meaning that you may get higher priority later if you need it.

If you can manage checkpointing, resuming and job rescheduling, you might not need to submit 36 hour long jobs. But, don’t submit jobs that are too small, there’s a lot of scheduling overhead that could slow the cluster for other users. It’s probably a bit pointless to run lots of small jobs if they each take less than an hour.

Tip 6: Disk and I/O

There are many storage tiers on the cluster with different quotas, backup options, and access speeds / latency.

The primary storage for models and work is the research data store which is mounted to ~/rds/hpc-work/ and comes with 1Tb of space. You do get a 10% grace over the storage quota for a few days, but don’t rely on this for poor management of your working files. This network file system storage persists after your job is finished. But it has a high latency, and isn't backed up.
The ~/ home directory is regularly backed up, but has a much more limited quota. The /local/ file system on each node provides much lower latency, but doesn't persist after the job.

You can purchase additional storage from this page.

Save space

Firstly, don’t save more than you need. For large models with millions or billions or parameters, each checkpoint that you save can be a few gigabytes. So, if you’ve got a lot of experiments or are running your models for a long time, you can easily eat up all of your hard disk quota.

The first trick is to remove checkpoints after you’ve finished training and testing the model. If you do really need to keep the checkpoints, it may only be worth keeping one or two checkpoints. You probably don’t need to resume training or make predictions from every epoch in the model.

Save time

If you are regularly saving checkpoints, don’t save them too frequently. Disk access can be quite slow and the time spent saving the model state can eat into the time you spend training, meaning that your HPC balance spent on GPU won’t be well used on other things. Choosing an appropriate checkpoint frequency is a balancing act, depending on the size of your model and the length of training.

Caching your features can save time if you have to do a lot of pre-processing for all your instances.

Save files

A common pitfall is to overwrite your old experiments if you re-run your script. A good way to stop clobbering files is to use the SLURM job number in the directory name. Limiting the blast radius to just one experiment.

After your experiements have finished training, you need to think about how to manage your assets. The hpc-work folder you get given has fast access and 1Tb of storage, but this isn't backed up.

Tip 7: Only use versioned code

It is important that all experiements and results can be reproduced: small changes in code can lead to substantial changes in model performance and these need to be identified. There's nothing worse than having different results and not knowing what you've changed that caused the performance to change.

A good way to prevent this from happening is to only use code that is managed in Git to run your experiments. This way, you can always connect your results with an exact version of your code. You can use a library such as gitpython to save the commit hash or tag into your experiments folder. Some libraries, such as HuggingFace, have support for this.

When uploading your code to the HPC, it's imporant to ensure that you don't have any files that are changed but not committed. Part of this requires personal discipline, such as only pulling files from GitHub into your repository. But, if you want to be strict about it, you could check for uncommitted changes before running.

You should also make a log of which libraries you are using, saving these in a requirements file. You can dump everything you're using by running pip freeze and saving the output to a text file.

Tip 8: Log more than you need

In the same vein, logging everything can help you identify what caused errors and contributes to your results. One of the easiet things you can change is to call bash with the -x flag which will echo all commands back, these will get picked up by the log file that SLURM saves. Also, save all the parameters used in the model into a file which you can use later, and also log things like GPU status, just in case thereís any problems that you need to raise to the HPC admins.

bash -x test.sh hello world
+ echo hello
hello
+ echo world
world

To understand whether your parameters are appropriate, and whether your model is diverging, logging the loss and accuracy help. Tools such as tensorboard and visdom can help visualise this. These change quite frequently and there's plenty of tutorials to help you get started with these.

Another sanity check which can save you a lot of time when debugging is to print a sample of the features, validating the preprocessing is working as you expect.

Tip 9: Don't spam the scheduler

Don't submit one job for each hyperparameter choice. SLURM, and most other schedulers support array jobs. This is one job, but with many parts. For example, calling sbatch with --array 1-1000 will create a job with 1000 parts, you can access the ID of the part with an environment variable $SLURM_ARRAY_TASK_ID. This means that you don't need to call the scheduler programmatically. As a rule of thumb, if you're programmatically calling the scheduler to submit jobs, you're probably doing it wrong.

You can find more info about array jobs on the SLURM website: https://slurm.schedmd.com/job_array.html

Tip 10: Automate hyperparamter search, model selection, and reporting

The final point, is don't manually search for hyperparamters. Do it programmatically, this helps reduce your bias and saves you a bunch of time waiting for jobs to finish. This isn't a new problem and if you look around, there's countless frameworks for to do model optimization available. And it's not too difficult to write your own either.

Check out:

To prevent mistakes when copying the results into your thesis or paper, see if you can dump a CSV file or print a pre-formatted latex table as part of your model output.

Travelling to Korea and the Quarantine Arrangements

James Thorne — Tue, 08 Feb 2022 13:06:15 GMT

I've recently travelled to Korea. While most of Europe has allowed quarantine-free travel, most visitors to Korea must undergo mandatory quarantine. And, without a long term visa or residency status, this must be done in a government facility. I found it difficult to find all the information I needed to fly because it was in a lot of different places and people publishing blogs had different experiences to me, so here's my take.

‼️

The guidance for arrival in Korea has changed recently. This post was a representation of my travel at a specific time (December 2021). There are now different systems in place (Q-Code) that should be followed.

Before Flying

I was travelling from the UK and luckily the UK is under a visa waiver programme with Korea. However, I did need to register for an electronic travel authorisation before I flew (K-ETA). This had a small cost and was processed in less than an hour from the government's website.

The second thing I needed before flying, like most travel these days, was a negative PCR test result. Korea is very particular about the test types they require and the certificate and this is all set out in this information page. I got my test on a Monday and flew out on a Thursday. Although I was worried about there being delays to my PCR test processing, it was performed very quickly and I had my result in less than 4 hours using the Southampton Airport PCR test service (the nearest test centre to where I was living before flying from London Heathrow).

When you fly to Korea, you do not need to register or book the quarantine in advance of flying. I turned up and went straight to the government managed hotel quarantine centre with a the shuttle bus service provided from the airport. I'll give more information about this later.

On arrival you will need to provide contact details of where you will be staying (after the quarantine). If you have a Korean friend, you can use their number (but check with them first). Otherwise, you can purchase a pre-paid SIM card which you can collect from the airport. But this needs to be purchased 24 hours before arrival.

My primary piece of advice would be to print a couple of copies of all documentation (including the PCR test and K-ETA form). Paper copies are mandatory on arrival in Korea and you don't want to be stuck! Bring a pen too!

Taking off from London Heathrow

When checking-in to my flight, a boarding pass could not be issued by my airline until I had undergone a document check before flying. I flew with Lufthansa and while they do have an electronic document verification, I didn't have all the information I needed before my flight. Instead, I had to go to a check in counter instead of the self bag-drop and the check-in assistant went through all my paperwork before the boarding pass was issued.

Before Landing

On the flight, the attendants provided all the required paper forms needed for arrival including the health declaration, travel declaration, customs declaration and landing card. There's plenty of forms to fill out which will keep you busy in the last hour of your flight.

Landing at Incheon Airport

After landing at Incheon airport from an international flight, all passengers will have to queue to go through an initial health screening. There's a pretty large queue for this and depending on how many flights arrive at the same time there could be some wait. For me this took 45 minutes. I wasn't really worried though as I was about to spend the next 10 days in a quarantine hotel and this time meant that there was one less hour I needed to spend in solitary confinement.

Near the front of the queue, there's a number of signs describing the quarantine procedure and copies of

At the end of the queue is an initial document screening. You submit some of the paper forms and health declaration as well as provide a copy of your paper PCR test certificate. Once you've passed this health screening, your passport will be given a sticker and you can then proceed to the main arrivals area of the terminal.

In the main arrivals area of the terminal, I picked up the SIM card I ordered and used the bathroom before continuing to the immigration desks. At the immigration desks, I gave my passport and the printed copy of the K-ETA to the immigration assistant who then gave me a tag to wear for the next station. The immigration is completed in three steps. The first step required me to install a quarantine tracking app on my phone and they checked that the contact number I had provided was valid by calling my contact in front of me. The second first step was at the immigration desk and was the confirmation of visa status and the final step was about my quarantine status. These were handled at three separate desks.

For the tag, it seemed like there was two colours: yellow and red. I was the red type as I arrived on a tourist visa exemption (B1 type) and had to go to a secondary area to sign some forms consenting to undergo government quarantine and receive an official order from the health office to quarantine. From what I saw, the people with the yellow tags had other arrangements and weren't going to the government quarantine centre. In the secondary area, the staff were very friendly and things were resolved fairly quickly.

After Immigration

After the immigration desk was completed, I could collect my bag and then proceed through customs area. I still had to wear this tag until I got on the bus to the quarantine centre. After customs, I was able to access some facilities in the arrivals area and go to a small convenience store to buy snacks. Some of the snacks I bought needed a microwave so this meant that I couldn't eat them at my quarantine hotel. Choose wisely!

I had to wait about 15 minutes for enough people to come through the airport and once there was a group of 6 or 7 of us, were paraded through the airport to a secondary area preparing us for the bus. There was a very strong surveillance and police presence in the airport and all the doors were protected. You only have one choice, and that's to get on the officially provided transport.

On the bus

You don't know which quarantine centre you'll be going to until you're on the bus. My quarantine centre was in Gimpo, about 30 minutes away from Incheon airport and closer towards Seoul. When we got the quarantine centre, we had to wait a short time before being allowed in the hotel.

The check in at the hotel was simple, we all arrived into one large room where we had an initial health screening, installing an app, and completing a form. We took the form to a doctor who then allowed us to proceed to the next area of the hotel.

In the next area of the hotel, we had to pay for the cost of the quarantine. This varies from hotel to hotel and you don't know how much it will cost until you arrive. For me, cost 120,000 KRW per night: 1,200,000 KRW (£770) in total. I could pay using my British debit card. They also gave us a cup ramen noodle snack and the key to our room after paying.

In the room

This is where 10 days get condensed into a couple of paragraphs. The routine at the hotel is fairly simple: you get three meals a day, take a temperature test every afternoon and have a PCR test on the day after you arrive and the day before you leave. I arrived on the December 2nd (12/2 mid-morning) and I was able to check out at midnight on the 12th of December (12/12 00:00). This meant that I spent just under 10 days in the quarantine as the first day of your arrival counts towards this time.

If you order deliveries to the hotel (not hot food), it will be delivered to your room at the next meal time. You can call the frontman to get other items delivered to the room such as water bottles. The hotel will give you 2L per day and 4L on your first day. There's a kettle for boiling water too.

The main downside of the hotel is that the food may be cold by the time you have your meal. While this is OK for some types of food, it can be pretty monotonous and make some food taste bad. By the end of the experience, I was craving something warm. While they do give cup ramens as a snack with some meals, I didn't eat these.

When checking out, you have a few options. You can either be picked up by a friend or taxi at a specified time, or use the shuttle provided by the government. This is the option I had on my check out form.

Enjoying my time in Korea

After checking out of the quarantine, I was able to enjoy my time in Korea. While I had a lot of questions and anxiety before flying to Korea because of all the regulations and different information, it wasn't as hard as I thought it would be. I hope that if you plan to visit, this guide will help you with any questions you may have. Please feel free to message me on Twitter @j6mes or Instagram @jp.thorne if you have any questions or things I should add.

Adding binding offsets for printing in Latex, the easy way

James Thorne — Sun, 06 Feb 2022 03:33:54 GMT

After completing my thesis, I needed to add binding offsets for printing. When I wrote my thesis, it was examined electronically and submitted as a PDF file. When you are preparing documents in this way, the margins are typically balanced on the left and right pages so that the content appears centrally on the page. However, when you print the document, about 5mm of the page is lost in the spine, so to remedy this, a binding offset is added to account for this loss.

In Latex, the geometry package allows this to be added really easily:

\usepackage[bindingoffset=5mm]{geometry}

However, adding this binding offset changes the width of the content pane (reducing it by 5mm), which causes things like my tikz figures and carefully crafted line endings (avoiding runts and widows etc) to all break and add a number of spurious line breaks and page overflows.

Instead, the easiest thing to do is to take the pdf file (thesis.pdf) and leave a 5mm offset when adding it into the page. Use the frame=true to show where the border is added. This fixes things without changing the width of the content frame.

\documentclass[a4paper, twoside]{article}
\usepackage[utf8]{inputenc}
\usepackage{geometry}

\usepackage{pdfpages}
\begin{document}
\includepdf[pages=-,fitpaper=true, offset=5mm 0mm,frame=false]{thesis.pdf}
\end{document}

Easy

Managing environments in Docker for artificial intelligence research

James Thorne — Sat, 09 May 2020 20:28:38 GMT

Docker is a great resource for virtualizing environments. It makes deploying code and work to any environment super simple. In research, it also helps make projects repeatable and shareable through pre-defined environments.

In research, it's important to keep the following parts of your project separate:

Environment - libraries, packages and datasets that your research depends on
Code - controls what your research is project is going to do
Configuration - parameters and design choices you control for experiments
Results - the logs, outcomes, pre-trained models and performance metrics of your experiments

This tutorial will focus on preparation of an environment that easily enables development, debugging and large scale experimentation.

Docker is a virtual machine that contains a single application. Unlike tools like VirtualBox and VMware, the machines are defined through simple configuration scripts called Dockerfiles that describe what files, libraries and software are present. The virtual machines will bind some files and ports on your local machine or storage network to read and write files.

Installing Docker is easy and can be done by following the following instructions. There are extensions from nVidia that enable GPU passthrough here.

On HPC clusters, Singularity may be installed instead of Docker. It enables (almost) seamless integration so that your code can be scaled up and deployed without much hassle.

This tutorial will focus on getting your project running in a Docker environment, focusing on 5 key tips:

Build a common base image if you have lots of projects
Keep data separate
Don't forget to mount paths for your cache directories
Keep your images debuggable
Keep your images archived and repeatable

Base environment

Depending on the project, it's useful to build your code on top of a predefined instance. This could be the AllenNLP docker image, Tensorflow image or a Miniconda instance.

In my research, I use the Miniconda template as a base image; it's a raw Python3 environment which I install all my packages and resources on top of.

Create a new directory and write add something similar to this in a file named Dockerfile

FROM continuumio/miniconda3

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

ENV TORCH_HOME=/local/cache
ENV ALLENNLP_CAHCE_ROOT=/local/cache

RUN apt-get update
RUN apt-get install -y --no-install-recommends \
    zip \
    gzip \
    make \
    automake \
    gcc \
    build-essential \
    g++ \
    cpp \
    libc6-dev \
    unzip \
    nano \
    rsync

RUN conda update -q conda

ADD requirements.txt /tmp
RUN pip install -r /tmp/requirements.txt

My base image contains some common programs and Python packages that I use in every project. The top line, beginning with FROM imports the miniconda image released from Dockerhub.

I also install some utils for zipping and managing files and building software using apt before installing my Python packages. These dependencies are useful for debugging and exporting results from your projects and save some headaches down the line.

My requirements.txt contains some packages that I use regularly that take a long time to install such as PyTorch and AllenNLP. You could adapt this specifically for your project. But the key thing here, is that these are common to pretty much all my experiments and I set this up once. Every project will still have its own requirements file and dockerfile too!

tqdm
torch
torchvision
allennlp>=0.9

The next step is to build your base image. This is a single one-liner that you can run from your project folder which will build the docker image with the name my_base_image ready for use!

docker build -t my_base_image .

For versioning it's recommended to use some kind of software configuration management like SVN or Git. GitHub offers a number of free private repositories to help get you started.

Rather than building the docker image manually every time, you can auto-build it every time you push to Git using a continuous integration service like Travis or CircleCI. You can also push the newly-constructed image to Dockerhub so that you can download it on your HPC cluster, cloud server or local machine.

Add the following file to your project called .travis.yml to enable builds after setting up an account on Travis. This will push your base image to Dockerhub if a commit is made to the master branch on GitHub. Dockerhub gives hosting for one free private docker image. I'd recommend using this for your project instance, not the base image (which is basically just a blank template).

language: python
services:
  - docker
python:
  - "3.6"
stages:
    - before_install
    - install
    - name: after_success
      if: branch = master
before_install:
  - docker login -u $DOCKER_USER -p $DOCKER_PASS
script:
  - echo "No script"
after_success:
  - docker build -t $DOCKER_ACCT/docker-base-image .
  - docker tag $DOCKER_ACCT/docker-base-image $DOCKER_ACCT/docker-base-image:build-$TRAVIS_BUILD_NUMBER
  - docker push $DOCKER_ACCT/docker-base-image:build-$TRAVIS_BUILD_NUMBER
  - docker tag $DOCKER_ACCT/docker-base-image $DOCKER_ACCT/docker-base-image:latest
  - docker push $DOCKER_ACCT/docker-base-image:latest
  - echo "Done"

Set the following secret environment variables in the project settings and you're ready to build!

DOCKER_USER=
DOCKER_PASS=
DOCKER_ACCT=

DOCKER_ACCT should be the same as your username unless you are using a project/team in on Dockerhub.

Project Image

Your project image can now extend the base image you've just published to Dockerhub using the FROM directive in the Dockerfile.

The Dockerfile does 4 things: first we create directories for source, scripts and configs to go; we then create volumes for work (where output from models will go) and cache (where pre-trained models and partially computed results will go); then we install our requirements; and finally copy the files from our project into these directories. Some of these steps are specific to my projects, such as the SpaCy download.

FROM mydockerhubaccount/docker-base-image

RUN mkdir -pv /local/src
RUN mkdir -pv /local/configs
RUN mkdir -pv /local/scripts

VOLUME /local/work
VOLUME /local/cache

WORKDIR /local/
ADD requirements.txt /local/
RUN pip install -r requirements.txt
RUN python -m spacy download en

ADD src /local/src
ADD configs /local/configs
ADD scripts /local/scripts

ENV PYTHONPATH=/local/src

It's important to order the steps in the Dockerfile correctly. Docker builds will cache the build – so if you're not careful, updating the scripts would re-install the dependencies and download SpaCy again. Setting the PYTHONPATH is an easy way to add the directory containing project source to the list of directories Python will search for code in. It's a lot simpler than a local pip install as this doesn't require you to maintain a setup.py script.

We add the src directory to the docker image so that we can archive everything necessary to run your experiments in the image and repeat experiments at a later date. Although for debugging, this can become a hassle, so in practice, we locally mount this directory through the docker command line to allow us to make small changes without rebuilding the entire docker image.

The same .travis.yml from the previous step can be copied for this step too, changing the name of the docker image file to the one you're using for this project.

Running your project locally

The project can be run locally using the docker command, you can run bash like this to get started for an interactive shell. The -it flag gives you an interactive terminal and the --rm flag will remove the shell after use. The --rm flag is important for experimental reproducibility as we always start from the same template state and we don't clog up our filesystem with containers with no useful info in them. GPUs can be mounted by following the instructions for nVidia docker.

docker run --rm -it  bash

While this allows us to poke around and check if your scripts work, we also need to mount the cache and work directories. This will allow us to use the share the files and logs generated by your script once it's finished running. These can be mounted through adding flags like -v $(pwd)/cache:/local/cache to map local directories containing cache files, datasets and our outputs.

In the following example we set up an alias that maps local directories data, work and cache to those in the docker image. It is important not to include these in the docker image as large files can quickly eat space and cause docker push/pulls to be unacceptably slow.

To speed up development, we create an additional alias for a docker image that accepts live changes to your code, configs and scripts, you can also mount the src, configs, and scripts through additional mounts.

Typing out these long commands can be tedious: you can set up bash aliases to simplify things. The ones I use are (note the --gpus flag which will have to be removed if you don't have GPUs):

alias drun="docker run --rm -it -v $(pwd)/data:/local/data -v $(pwd)/cache:/local/cache -v $(pwd)/work:/local/work --gpus all"
alias dtest="docker run --rm -it -v $(pwd)/data:/local/data -v $(pwd)/cache:/local/cache -v $(pwd)/work:/local/work -v $(pwd)/src:/local/src -v $(pwd)/scripts:/local/scripts -v $(pwd)/configs:/local/configs --gpus all"

The drun and dtest commands differ with local mounts and for the src, configs and scripts directory. I typically use dtest when developing and drun when I'm using a frozen version on my docker image: for papers I will tag a build with a specific label so I can replicate results.

Now, we can just run the following command

dtest mydockerimagename command

You can set up bash scripts for experiments or invoke python directly here.

Running your project on an HPC cluster

Some HPC clusters have singularity installed which will run pre-built docker images allowing you to use your same environment on both local development machines and large scale compute clusters. The command has slightly different parameters to the docker command line interface and must also use a docker image that has been converted into a singularity image.

For the run command, the -v flags are replaced with -B , --gpus is replaced with --nv and we must also manually specify the working directory. Furthermore, singularity will mount the local root filesystem. You may have to check that your choice of working directory in the docker image doesn't clash (e.g. if you have scratch storage mounted to /local)

singularity run --nv --pwd /work -B $(pwd)/data:/local/data -B $(pwd)/cache:/local/cache -B $(pwd)/work:/local/work -B $(pwd)/src:/local/src -B $(pwd)/scripts:/local/scripts -B $(pwd)/configs:/local/configs myimage.simg command

The command is the same as above. The docker image must be compiled into a singularity image with the following command, replacing your dockerhub username and image name:

singularity pull docker://dockerhubuser/imagename myimage.simg

When running large-scale jobs, it advisable to have a small configuration that will run the head-node downloading any pre-trained models and embeddings prior to submitting the job to the queue. This will help prevent files from clobbering if you scripts start at once when trying to download models to the same place without good file locking.

Debugging your project

The first step to debugging your projects is to use the bash -x flag when running scripts which will echo commands back this is especially useful when you have lots of environment variables and script parameters generated by other scripts.

For debugging Python, the Pro versions of modern IDEs like PyCharm enable development in docker instances and can be debugged as an extra environment: see this link for details.

If you use PDB, you can simply instantiate pdb from the terminal as your entrypoint to the application instead of running it locally.

Backing up Overleaf content to GitHub

James Thorne — Wed, 06 May 2020 10:51:21 GMT

Overleaf has become an essential tool for my academic work, allowing collaboration with my team and giving the ability for me to work on manuscripts on any computer without the need for extra tools. While the tooling has a large feature set and is reliable - even around major conference deadlines, it's important for me to ensure that work is backed up, safe and accessible.

In this post, I'll go through how I back up my work using the Git integration in Overleaf. Every change I make to a document is saved as a commit which I can use to recover my work from any point in time in its creation. Overleaf does have a GitHub sync which requires a manual push and will not automatically track changes. Also, this will not sync to providers of other Git repositories like GitLab. This method is both automatic and will work for any Git provider.

The key to this pipeline is running a continuous integration server that can poll my documents for changes and run some simple scripts to push them to GitHub at a regular interval. There are quite a few options here and I ended up using Jenkins for this. Jenkins supports build pipelines that can poll lots of different Git repositories for changes and only run the build pipeline on change.

Installing Jenkins

Installing Jenkins can take a bit of time depending on the operating system and environment you're running. But the simplest approach is to use the Docker container they release: all dependencies and environment is wrapped up into a lightweight virtual machine that makes deployment easy.

# pull the stable LTS version of jenkins
docker pull jenkins/jenkins:lts

# on first run make a directory for the directory and create a named docker instance with port 8080 mapped and the data directory mounted
mkdir jenkins_data
docker run -d --name jenkins -p 8080:8080 -v $(pwd)/jenkins_data:/var/jenkins_home jenkins/jenkins:lts

# to restart the named instance
docker start jenkins

Getting started with Jenkins is just 2 or 3 commands

Once the container is running, you can navigate to http://localhost:8080 in your web browser to configure and set up the server. You'll be asked for the admin password which will be saved in the jenkins_data folder you mount. For my setup, I just installed the default plugins.

Setting up credentials

Jenkins has a credential management system to safely store passwords and data for repositories - GitHub supports both HTTPS and SSH based authentication whereas Jenkins only supports HTTPS.

Both credentials for Overleaf and GitHub must be stored within Jenkins. Adding these is self-explanatory and the data can be directly entered into the web form. Credentials can be added by clicking the global scope and then following instructions to add the credentials.

If you plan on using Jenkins for multiple projects, I'd recommend setting up scopes specific for each project.

Source repositories

In Overleaf, Git integration was a default feature in v1 that was ported to v2. Users of v1 have Git integration in v2 with free accounts. For new users on v2, you may need a paid account.

The Git repo link can be found in the document menu in the Overleaf editor. Make a note of this!

Target repository

I have a large number of papers on Overleaf that I sync to a single project on GitHub. I created a new repository to start with.

Putting it all together

Now that all the credentials and repositories are ready, create a new "item" in Jenkins. We'll use a pipeline project.

Build Triggers

For build triggers, we want to poll the overleaf repositories at a regular interval. You can enter a time here in crontab style: */10 * * * * will poll every 10 minutes.

Pipeline Script

The pipeline script consists of 4 parts: configure git, pull the overleaf docs, merge the changes and push to GitHub. These all are nested in a node { } object.

Git is configured with 2 shell commands, just like you would do on your own computer:

    sh("git config --global user.email 'my@email.com'")
    sh("git config --global user.name 'James Thorne'")

Pulling data from Overleaf is easy with the following directives which will pull data from 2 different overleaf documents into the thesis and paper directories using the overleaf credentials we made earlier. The credentialsId has to correspond to what you named the credentials.

    dir("thesis") {
        git (url: "https://git.overleaf.com/12345",
        credentialsId: "overleaf")
    }
    dir("paper") {
        git (url: "https://git.overleaf.com/67890",
        credentialsId: "overleaf")
    }

To merge the data into our GitHub repo, we first need to check it out. Again, the credentials ID must match. We're checking out to the folder called github

    dir("github") {
        git (url: "https://github.com/j6mes/overleaf_backup.git",
        credentialsId: 'github')
    }

Then we'll remove the .git folders from the overleaf docs we're merging in to prevent corruption of the Git repository before copying them into the Github folder.

    # Remove git information from thesis and paper overleaf git repos
    sh("rm -rf thesis/.git")
    sh("rm -rf paper/.git")
    
    # Remove thesis and paper from the github repo
    sh("rm -rf github/thesis")
    sh("rm -rf github/paper")
    
    # Copy in updated versions
    sh("cp -r thesis github/")
    sh("cp -r paper github/")

Now we're ready to push everything back up to GitHub

    dir("thesis") {
        withCredentials([usernamePassword(credentialsId: 'github', passwordVariable: 'GIT_PASSWORD', usernameVariable: 'GIT_USERNAME')]) {
        sh("ls")

        sh("git add *")
        sh("git commit -am 'Auto commit from Overleaf' | true")
        sh("git push https://${GIT_USERNAME}:${GIT_PASSWORD}@github.com/j6mes/overleaf_backup master | true")
        }
    }

It's quite easy to add extra papers and Overleaf documents as you go on, and any changes will be uploaded to GitHub.

That's it - happy writing!

Comparing the Zachman Framework, TOGAF and MoDAF

James Thorne — Mon, 27 Jan 2020 09:58:15 GMT

Enterprise Architecture was introduced to address system complexity and poor business alignment. Typically:

IT systems have become unmanageably complex or too costly to maintain
IT systems are hindering an organisations ability to respond to market conditions in a timely and cost-effective manner
Mission-critical information is out of date or incorrect

Zachman Framework

In 1987, Zachman introduced a 'framework for information systems architecture' to address and manage the complexity of distributed systems. By looking at issues holistically and from different perspectives, Zachman's vision was to increase a businesses agility and increase the value from implementing a system.

The Zachman Framework is better described by a taxonomy for organising architectural artefacts (e.g. design documents/models/specifications) by target audience and issue rather than being called a framework which better reflect principles and practices for developing and maintaining the enterprise architecture repository.

This framework can be best visualised as a grid of concerns for each stakeholder within the business: what (data), how (function), where (location), who (people), when (time), why (motivation) are listed along the top. Level of abstraction for each concern are listed down the side which describe a refinement from a plan to a functioning system: scope (contextual model for the planner), enterprise (conceptual model for the business owner), system model (logical model for the designer), technology model (physical model for the implementer), detailed representation and finally the functioning system. At each cell, a model is used to describe information at the level of abstraction suitable to the target audience: e.g. a system model in the 'who' column may contain the human interface architecture, and yet at the technology model, this column will contain a presentation architecture.

Each model in the framework is additive and complimentary. Together, the collection of models forms a holistic view of the organisation that no one model can, possibly due to limits on levels of abstraction or expressiveness of a single type of model.

An artefact generated should only reside in one cell of the Zachman framework. If an artefact can be described by more than one cell on the taxonomy, perhaps questions should be raised about the quality or level of detail described in the artefact.

Completing the Zachman framework requires 36 models to be generated which sufficiently describes the system from the perspective of every stakeholder.

Zachman grid improves quality of the Enterprise Architecture by:

Ensuring every stakeholder's perspective has been considered
Ensuring each artefact has a specific focus point
Ensuring traceability of each business requirement to its implementation

The Zachman Framework, however, is not a complete solution. There are far too many issues that the Zachman Framework does not discuss: it does not describe the process for creating the architecture, or for evaluating the fitness for purpose of the proposed architecture.

TOGAF Framework: The Open Group Architecture Framework

The Open Group Architecture Framework describes enterprise architecture into four categories:

Business Architecture - describes the processes that a business uses to meet its goals
Applications Architecture - describes how applications interact
Data Architecture - describes how enterprise data is stored and access
Technical Architecture - hardware and software that supports applications

Naturally, applications architecture can be designed to meet the business architecture requirements,

One of the most important parts of the TOGAF framework is the Architecture Development Method (ADM): the process that describes how the enterprise architecture can be captured and maintained.

Models in TOGAF range from generic to specific: TOGAF describes that these lie on an Enterprise Continuum. ADM describes how generic models can be refined and have appropriate specificity added to meet the needs of the target stakeholder. The generic architectures are called "Foundation Architectures" and can be used by any enterprise. These are progressively refined to common system architectures - may only be relevant to a subset of organisations. Industry architectures describe patterns relevant to a domain. And finally, Organisational architectures are specific to an organisation.

TOGAF ADM describes a preliminary phase and a cycle of processes:

Phase A: Architecture Vision
Phase B: Business Architecture
Phase C: Information Systems Architecture
Phase D: Technology Architecture
Phase E: Solutions
Phase F: Migration planning
Phase G: Implementation Governance
Phase H: Architecture change management

The preliminary phase ensures buy-in from the organisation's stakeholders and evaluates the organisation's suitability to create and digest the architecture being created. This may involve adapting TOGAF to meet an organisation's needs. TOGAF is non-prescriptive and purposefully allows steps or phases to be skipped, partially completed or altered.

MoDAF

MoDAF is architecture framework developed by the British Ministry of Defence that captures information and allows it to be presented in standard viewpoints. The viewpoints are used by decision makers to help understand and document complex issues.

Viewpoints describe:

Strategic Viewpoint (StV) - desired business outcome and capabilities required to achieve it
Operational Viewpoint (OV) - the processes, information and entities needed to fulfil the capabilities.
Service Oriented Viewpoint (SOV) - services (units of work supplied by providers to consumers) to support the processes described in OV.
Systems Viewpoint (SV) - implementation of Operational and Service Oriented Viewpoints; defining the solution
Acquisition Viewpoint (AcV) - dependencies and timelines to deliver solution
Technical Viewpoint (TV)- standards applied to the solution
All Viewpoint (AV) - definition and glossary of the architecture.

MODAF describes a linear processes from establishing intended use of the project to documenting the results in a similar vein to the TOGAF architecture.

Processing Wikipedia in a few hours on a single PC

James Thorne — Thu, 23 Jan 2020 17:46:04 GMT

Wikipedia is a valuable resource to build Natural Language Processing systems for tasks such as Question Answering and Fact Verification. However, the sheer size of the resource can become an obstacle for new starters or those who think they haven't got the resources to crunch through the entire database. In my work on building resource for automated fact checking, I've relied a few strategies where Wikipedia can be processed on a single (8 core) desktop PC in about 4 hours – quick enough to try a few prototypes in a working day.

The tl;dr is to use a pipeline of workers to minimize the time the PC is spent idle, waiting for disk. Using the multi-stream files, the reader can be parallelized and using network based message queues, we can grow this beyond just a single PC.

Reading the Wikipedia data

Wikipedia data is available in a variety of formats. For this tutorial, we'll be using the .xml.bz2 dump which can be downloaded from the Wikipedia archives over at https://dumps.wikimedia.org/. These are available as a multi-part download. however, we'll just use the large dump (around 18Gb) with the multi-stream parts combined. The format of the file is looks like enwiki-[date]-pages-articles-multistream.xml.bz2.

The Wikipedia file stream can be opened using the BZ2File() class in Python and passed to a custom sax.ContentHandler that can be used to extract the information we want from the file. This was inspired by a Stackoverflow post, however, I can't find the original to give credit to.

The file stream contains an XML dump with nested pages that contain ns, title and text elements as shown in following example.


  0
  Page title
  Wikipedia source for page text

The ContentHandler class has callbacks for startElement endElement and characters which are called when we encounter a new element inside the Wikipedia XML file. Each time we encounter a title and text tag, we should store this and when we encounter the closing page tag, we can call a callback which puts these onto the message queue.

import xml.sax
import logging

logger = logging.getLogger(__name__)

class WikiReader(xml.sax.ContentHandler):
    def __init__(self, ns_filter, callback):
        super().__init__()

        self.filter = ns_filter
        
        self.read_stack = []
        self.read_text = None
        self.read_title = None
        self.read_namespace = None
        
        self.status_count = 0
        self.callback = callback


    def startElement(self, tag_name, attributes):
        if tag_name == "ns":
            self.read_namespace = None
            
        elif tag_name == "page":
            self.read_text = None
            self.read_title = None
            
        elif tag_name == "title":
            self.read_title = ""
            
        elif tag_name == "text":
            self.text_text = ""
            
        else:
            return

        self.read_stack.append(tag_name)


    def endElement(self, tag_name):
        if tag_name == self.read_stack[-1]:
            del self.read_stack[-1]

        if self.filter(self.read_namespace):
            if name == "page" and self.read_text is not None:
                self.status_count += 1
                self.callback((self.read_title, self.read_text))
                

    def characters(self, content):
        if len(self.read_stack) == 0:
            return

        if self.stack[-1] == "text":
            self.text += content

        if self.stack[-1] == "title":
            self.title += content

        if self.stack[-1] == "ns":
            self.ns = int(content)

The reader is a stack-based parser that reads one page at a time from the dump and calls the callback with the text of an article. You could customize these for your own project. The other aspect is a lambda function ns_filter that allows us to select which content we should use. The article namespace is ns:0. Other namespaces are listed here: https://en.wikipedia.org/wiki/Wikipedia:Namespace

The default namespace filter I use is lambda ns: ns==0

Article Processing

The article processing would do some cleaning of the Wikipedia markup and some processing such as sentence splitting or tokenization.

def process_article():
    while not (shutdown and aq.empty()):
    
        page_title,source = aq.get()
        text = clean(source)
    
        doc = nlp(text)

        sents = []
        for s in doc.sents:
            if len(sents) > 0:
                # Fix some spacy sentence splitting errors by joining sentences if they don't end in a period
                if len(str(sents[-1]).strip()) and str(sents[-1]).strip()[-1] != ".":
                    sents[-1] += str(s)
                    continue
             sents.append(str(s))

        out_text = "\n".join(sents)
        fq.put(json.dumps({"page": page_title, "sentences":out_text}))

Cleaning up Wikipedia source is a messy process. While tools such as mwparserfromhell help, they often miss bits which need to be cleaned. You can find some useful (but imperfect) functions in the FEVER library: https://github.com/awslabs/fever/blob/master/fever-annotations-platform/src/dataset/reader/cleaning.py

Saving to Disk

Saving to disk from the filequeue is simple

def write_out():
    while not (shutdown and fq.empty()):
        line = fq.get()
        out_file.write(line+"\n")

Putting it all together

Here we create a handful of worker processes that listen to the article queue aq. We create the Wiki reader and set the callback to aq.put. We also create a status process that will report the number of items read and the depth of the queues every second.

def display():
    while True:
        print("Queue sizes: aq={0} fq={1}. Read: {2}".format(
            aq.qsize(), 
            fq.qsize(), 
            reader.status_count))
        time.sleep(1)

if __name__ == "__main__":
    shutdown = False
    parser = ArgumentParser()
    parser.add_argument("wiki", help="wiki dump file .xml.bz2")
    parser.add_argument("out", help="final file .txt")
    args = parser.parse_args()
    
    manager = multiprocessing.Manager()
    fq = manager.Queue(maxsize=2000)
    aq = manager.Queue(maxsize=2000)
    
    wiki = BZ2File(args.wiki)
    out_file = open(os.path.join(args.out),"w+")

    reader = WikiReader(lambda ns: ns == 0, aq.put)

    status = Thread(target=display, args=())
    status.start() 

    processes = []
    for _ in range(15):
        process = Process(target=process_article)
        process.start()

    write_thread = Thread(target=write_out)
    write_thread.start()

    xml.sax.parse(wiki, reader)
    shutdown = True

Adding a custom homepage on Ghost

James Thorne — Wed, 15 Jan 2020 12:37:45 GMT

As part of my setup for Ghost, I wanted to have a custom editable home page as well as a listing of recent blog posts. While getting this set running was quite simple to do, there weren't many tutorials around explaining how to do this. The information needed is buried in some API documentation; so I thought I'd share how to do this in an easy and succinct manner.

There are two parts to this solution: adding a custom template for the home page, and adding a custom route telling Ghost to use the custom template for the route /.

Getting started

To build custom themes, it's helpful to have Yarn and Gulp. On Mac OSX this is as simple as a homebrew install:

brew install node yarn
npm install gulp

Full guidelines for installation are available here: https://yarnpkg.com/lang/en/docs/install and https://www.npmjs.com/package/gulp

Custom homepage

The next step is to create a custom homepage as a Ghost page - this is fairly self explanatory. It's important to make note of the page URL in the menu that you assign. I set mine to home. The novelty of this tutorial vs the tutorial on the Ghost website is that this page can be editable as a Ghost page rather than being static HTML that is baked into the template file.

Custom template

Following the guidelines for a custom homepage, you'll be making some edits to the Ghost template. Download the Ghost casper template from your blog (in the design page) and make changes to it as follows:

Add a home.hbs file

We'll make a home.hbs file that combines parts from index.hbs and post.hbs. I started by copying all of index.hbs and adding a few sections from post.hbs. We don't really need all the information from the post page. Here's what my template looks like:

...

    
    
        {{#page}}
        

            

                {{title}}

            

            
                
                    {{content}}
                
            

        
        {{/page}}
    
    
    
    {{#get "posts"}}
    
        
            {{#foreach posts}}

                {{!-- The tag below includes the markup for each post - partials/post-card.hbs --}}
                {{> "post-card"}}

            {{/foreach}}
        
    
    {{/get}}

...

You can tweek this to your liking. There's 2 parts I'd like to point out: line 5-21 is copied from post.hbs and displays the post. line 25 is a query to get a list of posts. You can edit this to only show featured posts or posts with certain tags/authors by following this documentation: https://ghost.org/docs/api/v3/handlebars-themes/helpers/get/

Custom routes

Download the current routes.yaml file from your blog. At the time of writing, you can find it on the labs page:

I added a custom route from / to use the home template and load data from the home page. Here's what my routes.yaml file looks like after editing.

routes:
  /:     
    data: page.home
    template: home
    
collections:
  /blog/:
    permalink: /blog/{slug}/
    template: index

taxonomies:
  tag: /tag/{slug}/
  author: /author/{slug}/

Putting it all together

The next steps were to upload the route, compile and upload the template and add a static link to /blog in the menu.

To compile the template, go to the folder you've been editing the casper theme in and run the following command which will output a compiled version of the theme in the dist/ folder.

yarn zip

That's it!

A fresh start for the blog

James Thorne — Wed, 15 Jan 2020 00:23:52 GMT

I've just switched my blog over from WordPress to Ghost, a newer, more responsive, lighter platform. I've opted to start again from scratch rather than copy all the fragmented articles from when I first started blogging during my undergraduate years back in 2012.

The WordPress site wasn't as fast as I wanted it to be... site load times were averaging 750ms with Apache Bench. This was caused by a few factors, including the bloat of WordPress and the need to run an MySQL instance too. I had added a caching layer, but this didn't yield the improvements I wanted.

The new site running on Ghost is a lot more responsive. Median response times are down to 105ms. I've also ditched the MySQL back-end and opted for an sqlite database instead. This is handy as I only have one server running the site - if I need to scale up, I could sync this from a master.

To get Ghost running, I simply used a docker image that sits behind a reverse proxy running nginx. This allows me to serve other content from port 80 and force all traffic to use HTTPS. You can check out the reverse-proxy docker image I made here: https://github.com/j6mes/reverse-proxy

The site is just a simple docker-compose script which I can use to boot both services, setting the CERTS_DIR and CONTENT_DIR environment variables to mount the docker volume endpoints.

version: '3.1'

services:
  ghost:
    image: ghost
    hostname: ghost
    restart: always
    environment:
      url: https://jamesthorne.co.uk
    volumes:
      - $CONTENT_DIR:/var/lib/ghost/content
  proxy:
    image: j6mes/reverse-proxy
    links:
      - ghost
    depends_on:
      - ghost
    ports:
      - "80:80"
      - "443:443"
    restart: "always"
    environment:
      TARGET_URL: "ghost"
      SERVER_NAME: "jamesthorne.co.uk"
      TARGET_PORT: 2368
      LISTEN_PORT: 80
      LISTEN_PORT_SSL: 443
      STATUS_IP: $STATUS_IP
    volumes:
      - $CERTS_DIR:/etc/nginx/certs

So this is all I needed to get the site up and running on my server, I guess now I need to start writing some higher quality blog posts.