Introduction to Digital Humanities in Galaxy

Overview
Creative Commons License: CC-BY Questions:
  • How to get started in Galaxy for text-related tasks?

Objectives:
  • Log in to Galaxy

  • Upload files to the platform

  • Use tools within Galaxy

  • Clean and prepare text data

  • Compare two texts

  • Visualize your results

Time estimation: 1 hour
Level: Introductory Introductory
Supporting Materials:
Published: Sep 11, 2025
Last modification: Sep 11, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1

Loosely building on Richardson 2003, this tutorial compares two editions of the poem “The Sorrows of Yamba”.1 The first couple of steps derive from A short introduction to Galaxy.

“The Sorrows of Yamba” was published in 1795 and was among the most popular antislavery poems. However, the version published by Hannah More in the Cheap Repository Tracts series was not the only version of the poem that circulated. Also, Moore’s authorship on the topic is contested (Richardson 2003). But while we leave this debate to the experts, the different versions of the poem offer a great opportunity to delve into how digital tools can help us compare texts more quickly. We will do this in the following tutorial.

While Richardson compared the poems by hand, we use his example to introduce how Galaxy can help you with your text analysis. This tutorial covers the Galaxy basics, from logging in and uploading the texts to using the first tools. We will clean the two poem versions and check the texts from a distance by comparing their number of lines and characters, and visualizing both in a word cloud. Then, we take a closer look. For an easier comparison, we reformat both texts and compare them line by line and side by side. As the word cloud shows, “death” is a dominant theme in the first poem, so we extract all lines including “death” for further in-depth analysis. This helps us get a better idea of where those articles differ and is applicable to many other texts you might want to compare.

Agenda

In this tutorial, we will cover:

  1. Get started in Galaxy
    1. Create an account on Galaxy
    2. Log in to Galaxy
    3. Name your current history
    4. Upload a file to Galaxy
  2. Clean your Texts
    1. Delete the hyperlink
    2. Remove punctuation
  3. Different ways to compare the texts
    1. Compare quantitatively
    2. Compare visually
    3. Replace spaces with line breaks to prepare side-by-side comparison
    4. Compare side-by-side with diff
  4. Extract specific sentences
    1. Breaking text into sentences
    2. Extract sentences containing ‘death’
  5. Conclusion

Get started in Galaxy

Create an account on Galaxy

To use Galaxy’s full potential, you must register and create an account. You can skip this step if you already have a Galaxy account.

  1. To create an account at any public Galaxy instance, choose your server from the available list of Galaxy Platforms.

    There are several UseGalaxy servers:

  2. Click on “Login or Register” in the masthead on the server.

    Login or Register on the top panel

  3. On the login page, find the Register here link and click on it.

  4. Fill in the the registration form, then click on Create.

    Your account should now get created, but will remain inactive until you verify the email address you provided in the registration form.

    Banner warning about account with unverified email address

  5. Check for a Confirmation Email in the email you used for account creation.

    Missing? Check your Trash and Spam folders.

  6. Click on the Email confirmation link to fully activate your account.

    galaxy-info Delivery of the confimation email is blocked by your email provider or you mistyped the email address in the registration form?

    Please do not register again, but follow the instructions to change the email address registered with your account! The confirmation email will be resent to your new address once you have changed it.

    Trouble logging in later? Account email addresses and public names are caSe-sensiTive. Check your activation email for formats.

Alternatively, you can log in using a single sign-on of your choice, for example, from IAM4NFDI on Galaxy Europe.

Screenshot of Galaxy Europe register window with the IAM4NFDI login button highlighted.

Log in to Galaxy

Hands On: Log in to Galaxy
  1. Open your favourite browser (Chrome, Safari, Edge or Firefox as your browser, not Internet Explorer!)
  2. Browse to your Galaxy instance, for example Galaxy Europe
  3. Log in with your credentials

Screenshot of Galaxy Europe with the register or login button highlighted.

Comment: Different Galaxy servers

This is an image of Galaxy Australia, located at usegalaxy.org.au

The particular Galaxy server you are using may look slightly different and have a different web address.

You can also find more possible Galaxy servers at the top of this tutorial in Available on these Galaxies

The Galaxy homepage is divided into four sections (panels):

  • The Activity Bar on the left: This is where you will navigate to the resources in Galaxy (Tools tool, Workflows galaxy-workflows-activity, Histories galaxy-history-storage-choice, etc.)
  • Currently active “Activity Panel” on the left: By default, the tool Tools activity will be active and its panel will be expanded
  • Viewing panel in the middle: The main area for context for your analysis
  • History of analysis and files on the right: Shows your “current” history; i.e.: Where any new files for your analysis will be stored

Screenshot of the Galaxy interface with aforementioned structure.

The first time you use Galaxy, your history panel is empty.

Name your current history

Your “History” is on the panel on the right. It is a record of the actions you have taken.

Hands On: Name history
  1. Go to the History panel (on the right)
  2. Click galaxy-pencil (Edit) next to the history name (which by default is “Unnamed history”)

    Screenshot of the galaxy interface with the history name being edited, it currently reads "Unnamed history", the default value. An input box is below it.

    Comment

    In some previous versions of Galaxy, you will need to click the history name to rename it as shown here: Screenshot of the galaxy interface with the history name being edited, it currently reads "Unnamed history", the default value.

  3. Type in a new name, for example, “My Analysis”
  4. Click Save
Comment: Renaming not an option?

If renaming does not work, you may not be logged in, so try logging in to Galaxy first. Anonymous users can only have one history, and they cannot rename it.

Upload a file to Galaxy

The “Activity Bar” can be seen on the left-most part of the interface.

Hands On: Upload a file
  1. At the top of the Activity Bar, click the galaxy-upload Upload activity

    upload data button shown in the galaxy interface.

    This brings up a box:

    the Galaxy upload dialogue, the 'regular' tab is active with a large textarea to paste subsequent URL.

  2. Click Paste/Fetch data
  3. Paste in the address of both files in the Zenodo folder:
    https://zenodo.org/records/17053220/files/SoY_Cheap_Repo_Source.txt
    https://zenodo.org/records/17053220/files/SoY_Univ_Mag_Source.txt
    
  4. Click Start
  5. Click Close

Option 2: On usegalaxy.eu, you can alternatively import the Zenodo files directly from a data library within Galaxy:

  1. At the top of the Activity Bar, click the galaxy-upload Upload activity
  2. Click on the bottom of the newly opened window on Choose from repository.
  3. Enter “Zenodo” in the search bar and click on the folder “Zenodo”.
  4. Enter Training material for Galaxy tutorial “Introduction to Digital Humanities in Galaxy” in the search bar and select the items.
  5. Click Select
  6. Click Start
  7. Click Close

Your uploaded file is now in your current history. When the file has been uploaded to Galaxy, it will turn green.

Comment

After this, you will see your first history item (called a “dataset”) in Galaxy’s right panel. It will go through the grey (preparing/queued) and yellow (running) states to become green (success).

The contents of the file will be displayed in the central Galaxy panel. If the dataset is large, you will see a warning message which explains that only the first megabyte is shown.

Hands On: View the text files content
  1. Click the galaxy-eye (eye) icon next to the dataset name, to look at the file content

    galaxy history view showing a single dataset mutant_r1.fastq. Display link is being hovered.

  2. Check the datatype - is it txt? Then you are all set. Otherwise, adapt the datatype.

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select datatypes from “New Type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  3. Add a tag to each database a corresponding to the file’s origin.

    • One saying #cheap for the file from the cheap repository (SoY_Cheap_Repo_Source.txt)
    • The other one #universal for the second one (SoY_Univ_Mag_Source.txt)
    • Don’t forget the hashtags

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

What are those files?

You can see two text files; they are two versions of the poem “The Sorrows of Yamba”. The file “SoY_Cheap_Repo_Source.txt” is a poem version of Sorrows of Yamba, which was published in the Cheap Repository. The file “SoY_Univ_Mag_Source.txt” is another version of the poem, first published in the Universal Magazine in 1797.

Both files start with “Text adapted from:” and a different hyperlink. The second paragraph for both texts begins with “the sorrows of yamba,” but the files continue differently. While one gives the year, the other is immediately followed by more text. Both texts are already pre-cleaned and are entirely in lower case, but still contain punctuation.

It is obvious that the texts have similarities, but they are not identical. Now comes the fun part: Using Galaxy to compare your files. To do that, we first need to clean both files.

Clean your Texts

When looking at the two datasets, you will notice they still contain the hyperlink from their source. As this is metadata and not the text we want to compare, we delete it at the beginning of both files.

  1. Click on Tools tool in the left panel
  2. Search for Remove beginning and pass the following parameters:
    • “Remove first”: 1 (lines)
    • param-file “from”: 1: SoY_Cheap_Repo_Source.txt
  3. Click on Run Tool workflow-run

    Comment: What does this tool do?

    Remove beginning deletes a selected number of lines from your file. In this case, removing the first line is enough.

When the job is finished and appears green in your history, click on its name.

Question
  1. Check how many lines the file now contains?
  2. How does this differ from the original file you uploaded?
  1. The file now contains only one line.
  2. The originally uploaded text contained two lines. You removed one with this step.

As a result, only the poem’s text remains, while the source was removed for text one. Galaxy names the files after the task used to create that step. While this can be helpful, we change the name to a clearer filename.

Hands On: Rename the output
  1. Change the name galaxy-pencil of the output of this tool, which removed the beginning of SoY_Cheap_Repo_Source.txt
    • Rename it to SoY_Cheap_Repo.txt
    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

We also use this tool on the second file.

  1. Run workflow-run Remove beginning with the following parameters:
    • param-file “from”: 2: SoY_Univ_Mag_Source.txt
    1. Expand one of the output datasets of the tool (by clicking on it)
    2. Click re-run galaxy-refresh the tool

    This is useful if you want to run the tool again but with slightly different paramters, or if you just want to check which parameter setting you used.

Once it is finished, rename this file to SoY_Univ_Mag.txt.

Click on the finished dataset that just appeared in your history. Check that it starts with the poem text and that the hyperlink is removed. To quickly see which version of the poems we have, we rename both datasets with clearer names and add tags based on the text origin. The hashtag propagates the tags, so all further outputs from this dataset contain the same hashtag, making it much easier to identify what text we are currently working with.

Depending on how detailed you want to compare your texts, we suggest further unifying them. In the next step, we remove all the punctuation with one command.

Remove punctuation

Regular Expressions (RegEx) allow you to search for particular patterns in your text. They can be a massive help if you want to extract or remove them with minimal work. In our two poems, the punctuation is not unified, and therefore, we want to remove it from both using RegEx. If comparing the punctuation of texts is also relevant to you, you can skip this step. Make sure to select the text version from the Cheap Repository that we have earlier removed the hyperlink from.

Hands On: Remove Punctuation in Poem One
  1. Run workflow-run Replace Text - in entire line ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to process”: SoY_Cheap_Repo.txt
    • In “Replacement”:
      • param-repeat “Insert Replacement”
        • “Find pattern”: [[:punct:]]
        • “Replace with”: (leave this empty)

    Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.

    Finding

    Below are just a few examples of basic expressions:

    Regular expression Matches
    abc an occurrence of abc within your data
    (abc|def) abc or def
    [abc] a single character which is either a, b, or c
    [^abc] a character that is NOT a, b, nor c
    [a-z] any lowercase letter
    [a-zA-Z] any letter (upper or lower case)
    [0-9] numbers 0-9
    \d any digit (same as [0-9])
    \D any non-digit character
    \w any alphanumeric character
    \W any non-alphanumeric character
    \s any whitespace
    \S any non-whitespace character
    . any character
    \. literal . (period)
    {x,y} between x and y repetitions
    ^ the beginning of the line
    $ the end of the line

    Note: you see that characters such as *, ?, ., + etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So \? matches the question mark character exactly.

    Examples

    Regular expression matches
    \d{4} 4 digits (e.g. a year)
    chr\d{1,2} chr followed by 1 or 2 digits
    .*abc$ anything with abc at the end of the line
    ^$ empty line
    ^>.* Line starting with > (e.g. Fasta header)
    ^[^>].* Line not starting with > (e.g. Fasta sequence)

    Replacing

    Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups (...), which we can refer to using \1, \2 etc for the first and second captured values. If you want to refer to the whole match, use &.

    Regular expression Input Captures
    chr(\d{1,2}) chr14 \1 = 14
    (\d{2}) July (\d{4}) 24 July 1984 \1 = 24, \2 = 1984

    An expression like s/find/replacement/g indicates a replacement expression, this will search (s) for any occurrence of find, and replace it with replacement. It will do this globally (g) which means it doesn’t stop after the first match.

    Example: s/chr(\d{1,2})/CHR\1/g will replace chr14 with CHR14 etc.

    You can also use replacement modifier such as convert to lower case \L or upper case \U. Example: s/.*/\U&/g will convert the whole text to upper case.

    Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the s/../../g structure.

    There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.

    Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.

    Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.

    Tip: Cyrilex is a visual regular expression tester.

  2. Rename your output file (once it is green) to SoY_Cheap_Repo_cleaned.txt

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button

And we repeat the same for the second text. Remember to use the redo button if you want to save some time.

Also in text two, we search for the pattern [[:punct:]] and omit a replacement, meaning that all punctuation marks will be deleted. Make sure to select the text version from the Universal Magazine that we earlier removed the hyperlink from.

Hands On: Remove Punctuation in Poem Two
  1. Run workflow-run Replace Text - in entire line ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to process”: SoY_Univ_Mag.txt
    • In “Replacement”:
      • param-repeat “Insert Replacement”
        • “Find pattern”: [[:punct:]]
  2. Rename galaxy-pencil the output file to SoY_Univ_Mag.txt_cleaned.txt

To get an idea of how the two cleaned texts compare, we check out their metadata.

Different ways to compare the texts

Compare quantitatively

The tool Line/Word/Character count allows us to get a quick overview of a text. We want to see if the cleaned versions are different from each other.

Hands On: Count the Characters of Poem One
  1. Run workflow-run Line/Word/Character count with the following parameters:
    • param-file “Text file”: SoY_Cheap_Repo_cleaned.txt
  2. Rename galaxy-pencil the output of this step to Line/Word/Character count Cheap Repo.

Once the dataset has finished running and appears green, click on the eye galaxy-eye symbol. You can see how many lines, words and characters the text consists of. And again, we run workflow-run the tool on the second poem.

Hands On: Count the Characters of Poem Two
  1. Run workflow-run Line/Word/Character count with the following parameters:
    • param-file “Text file”: SoY_Univ_Mag.txt_cleaned.txt
  2. Rename galaxy-pencil this output to Line/Word/Character count Universal for easier distinction.
Question: How do the texts compare
  1. How many lines do the poems have?
  2. Which of the two texts contains more words, and how many?
  1. Both texts consist of only two lines.
  2. The poem version from the cheap repository is longer, containing 1139 words, more than double the amount of the second poem.

The differences between the two texts are quantifiable, but do these also affect the content?

Compare visually

A picture says more than 1000 words! Accordingly, we want to get closer to the actual content of both texts. Particularly for larger corpora, a word cloud can be a nice way to get a first idea of what a text is about. Make sure not to use the latest outputs this time, as they contain only metadata and not the texts we want to compare. Select the cleaned poem versions for a more meaningful word cloud output.

Hands On: Visualize the Content of Poem One
  1. Run workflow-run Generate a word cloud ( Galaxy version 1.9.4+galaxy2) with the following parameters:
    • param-file “Input file”: SoY_Cheap_Repo_cleaned.txt (output of Replace Text tool)
    • “Do you want to select a special font?”: Use the default DroidSansMono font
    • “Color option”: Color
    • “Scaling of words by frequency (0 - 1)”: 0.8
    Comment: Adapting the Word Cloud

    The word cloud has many different features. You can upload a stop word list that should be excluded from the visualization, or play around with other parameters like the text size. Rerun dataset-rerun the tool with some changed parameters and see what happens.

We also rerun dataset-rerun the word cloud with the second poem.

The word cloud for the second text is created in the same way. We suggest rerunning the tool with the second text, but with the same parameters you used for creating the first word cloud image, for better comparability. That makes both comparable.

Hands On: Visualize the Content of Poem Two
  1. Run workflow-run Generate a word cloud ( Galaxy version 1.9.4+galaxy2) with the following parameters:
    • param-file “Input file”: SoY_Univ_Mag.txt_cleaned.txt (output of Replace Text tool)
    • “Do you want to select a special font?”: Use the default DroidSansMono font
    • “Color option”: Color
    • “Scaling of words by frequency (0 - 1)”: 0.8
    Comment: Uniqueness of the Word Cloud

    The word cloud from this tool looks a little different each time you run it. The layout may vary even when you are redoing it with the exact same texts and inputs.

Comparing items from your history is easiest when enabling the window manager and seeing both images side by side.

If you would like to view two or more datasets at once, you can use the Window Manager feature in Galaxy:

  1. Click on the Window Manager icon galaxy-scratchbook on the top menu bar.
    • You should see a little checkmark on the icon now
  2. View galaxy-eye a dataset by clicking on the eye icon galaxy-eye to view the output
    • You should see the output in a window overlayed over Galaxy
    • You can resize this window by dragging the bottom-right corner
  3. Click outside the file to exit the Window Manager
  4. View galaxy-eye a second dataset from your history
    • You should now see a second window with the new dataset
    • This makes it easier to compare the two outputs
  5. Repeat this for as many files as you would like to compare
  6. You can turn off the Window Manager galaxy-scratchbook by clicking on the icon again

Question
  1. What is the most prominent word in each of the clouds?
  2. How do the Word Clouds for Poem One and Poem Two compare?
  1. The most prominent word in the word cloud created from the cheap repository is “yamba”, while the one from the universal text is “death”.
  2. The word cloud from the cheap repository has four words that appear most prominent and are much bigger - and therefore more frequent in the text. They are “yamba”, “now”, “death” and “ye”. The most prominent words in the universal text are “death”, “yamba” and “africs”. They appear a bit smaller than the words from the cheap repository, suggesting a lower frequency.
Word Cloud of Cheap Repository Version. Open image in new tab

Figure 1: Word Cloud of Cheap Repository Version
Word Cloud of Universal Text Version. Open image in new tab

Figure 2: Word Cloud of Universal Text Version

You can disable the window manager again by clicking on the item, then you will see your datasets again in your middle panel, once you click on its eye galaxy-eye symbol.

The visualisation suggests that the text’s metrics, which we checked with the line and character count, and their messages differ. The cheap repository text addresses the reader with multiple mentions of “ye”, you, which is rare in the second poem. In the universal poem, death is more central than yamba, which is the other way around in the cheap repository text.

With this text’s length and just two poems, this is, of course, something you can find out by reading both texts. However, this distant reading approach can give you important preliminary insights to guide your close reading, particularly with bigger corpora.

Of course, the word cloud insights are just a first glance and do not allow a proper analysis; we need to compare both texts properly. But what is a good way to do this? We suggest comparing them side by side and line by line. For that, we adapt the layout once more.

Replace spaces with line breaks to prepare side-by-side comparison

We used the tool to replace text before. We are not deleting something this time, as we did with the punctuation, but we are replacing some characters. To get a convenient layout that shows one word per line, we replace the spaces (\s) with line breaks (\n). That way, each word gets displayed in a different line, which prepares the detailed comparison in the next step.

Regular Expressions help again by changing all spaces with line breaks with just one command.

Hands On: Changing Layout of Poem One
  1. Run workflow-run Replace Text - in entire line ( Galaxy version 9.5+galaxy0) with the following parameters:
    • param-file “File to process”: SoY_Cheap_Repo_cleaned.txt
    • In “Replacement”:
      • param-repeat “Insert Replacement”
        • “Find pattern”: \s
        • “Replace with:”: \n
    Comment: How do I understand the RegEx commands?

    Don’t worry, if you have never used regular expressions. Several websites help you find out what patterns to detect and how to catch the passages you need. For now, you can just add the symbols that stand for the space (\s) and the line break (\n).

  2. Rename galaxy-pencil this text SoY_Cheap_Repo_word_per_line.txt.

When you click on the eye galaxy-eye icon of the data set in the history now, when the dataset turns green, you can see that it now contains one word per line. To match this, we repeat the step with the same parameters also for the second poem.

Hands On: Changing Layout of Poem Two
  1. Run workflow-run Replace Text - in entire line ( Galaxy version 9.5+galaxy0) with the following parameters:
    • param-file “File to process”: SoY_Univ_Mag.txt_cleaned.txt
    • In “Replacement”:
      • param-repeat “Insert Replacement”
        • “Find pattern”: \s
        • “Replace with:”: \n
  2. Rename galaxy-pencil this text SoY_Univ_Mag_word_per_line.txt.
Question
  1. How many lines long are the poems now?
  1. When you click on the two names of the two new datasets you just worked on, you see that one is now 539, the other 1139 lines long. The number of lines now matches the word number we detected with the tool Line/Word/Character count.

Now, both poems show one word per line, which is the perfect setup to compare them side by side. Use a tool called diff to visualise this. To get the same order as the tutorial, make sure to select the version from the Cheap Repository as the first input file and the one from the Universal Magazine as the second input file.

Compare side-by-side with diff

Hands On: Compare the Poems
  1. Run workflow-run diff ( Galaxy version 3.10+galaxy1) with the following parameters:
    • param-file “First input file”: SoY_Cheap_Repo_word_per_line.txt
    • param-file “Second input file”: SoY_Univ_Mag_word_per_line.txt
    • “Choose a report format”: Generates an HTML report to visualize the differences
    • “Choose report output format”: Side by side
    Comment: Different Report Formats

    The diff tool allows you to create different outputs, depending on your goal. In this case, the HTML report contains colours to highlight the changes between both texts, making it really useful for researchers to quickly identify. If you want to extract information automatically, the option text file, side by side could also be helpful.

We get two new files as a result. The HTML report and the raw output it is based on, in txt format.

Question
  1. What is the first difference between the two texts visualized in the HTML report?
  1. Lines 6-40 of the cheap poem are marked in green. They are not part of the universal poem. The couple of lines before and after are identical.

In the HTML report, you can quickly identify deletions (in red) and additions (in green) between both texts. You can also see smaller details, which you might quickly miss manually. Lines 63/64 and 28/29, respectively, show that also changes within one word (prisoner / prisner) are detected. You can furthermore see how the perspective was changed between the poems. While line 359-361 in the cheap repository text states “they sell us”, the other text states “they sell them” (l. 298-300), suggesting the reader is (no longer) among the group which is sold. You can go through it and detect further changes in language and length.

Seeing this, you might want to go into detail with the respective themes once more. As “death” was central in both texts, we will extract sentences containing this word so you can analyze them more closely. The cleaned texts without punctuation and one word per line are not the easiest form for this. Instead, we use an earlier version from our history.

Extract specific sentences

Breaking text into sentences

We return to Regular Expressions a third time, but this time we use a different tool with further functionalities. We use it to divide the text into more lines, to make it easier to extract those containing the word “death.” Here, punctuation is a helpful stop point. We use full stops to indicate a sentence, which will not be perfectly accurate but sufficient for this case. We then add a line break after the full stops to get complete sentences. Of course, you could spend more time on this and make it neater. Make sure to use not the last input but the poems without a hyperlink, but including punctuation. It will not work if the text contains no more full stops.

Hands On: Rearrange Poem One
  1. Run workflow-run Replace parts of text ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to process”: SoY_Cheap_Repo.txt
    • In “Find and Replace”:
      • param-repeat “Insert Find and Replace”
        • “Find pattern”: \.
        • “Replace with”: \.\n
        • “Find-Pattern is a regular expression”: Yes
        • “Replace all occurences of the pattern”: Yes
        • “Find and Replace text in”: entire line
    Comment: What do those inputs mean?

    A full stop (.) has its own meaning in regular expressions. It stands for all elements. To show that we do not mean all characters but actually a full stop, we need to escape it in RegEx by putting \. instead of . if we mean a full stop. We want to add a line break afterwards, which we already learned is indicated as \n. The replacement pattern, therefore, is \.\n.

  2. Rename galaxy-pencil your resulting file to SoY_Cheap_Repo_sent_per_line.txt.

Remember to redo this step for the second poem when you have finished this step.

Hands On: Rearrange Poem Two
  1. Run workflow-run Replace parts of text ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “File to process”: SoY_Univ_Mag.txt
    • In “Find and Replace”:
      • param-repeat “Insert Find and Replace”
        • “Find pattern”: \.
        • “Replace with”: \.\n
        • “Find-Pattern is a regular expression”: Yes
        • “Replace all occurences of the pattern”: Yes
        • “Find and Replace text in”: entire line
  2. Rename galaxy-pencil your resulting file to SoY_Univ_Mag_sent_per_line.txt for easier distinction.

As a result, you get two files, each split at full stops. How can you now extract the sentences that are relevant to you?

Extract sentences containing ‘death’

Use Search in textfiles ( Galaxy version 9.5+galaxy2) to select all lines containing the word “death”.

Hands On: Extract particular sentences
  1. Run workflow-run Search in textfiles ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “Select lines from”: SoY_Cheap_Repo_sent_per_line.txt
    • “Regular Expression”: death
    Comment: Further Functionalities

    You can see that the tool has many parameters you can tweak. The ones not mentioned here are kept at the default input, like Match and Perl, which is the kind of RegEx applied. But you could also select all lines that do not contain death by selecting Do not match or extracting lines before or after the line containing the content you chose.

  2. Rename galaxy-pencil your output SoY_Cheap_Repo_death.txt

And for the last time, we redo this step for the second poem.

Hands On: Extract particular sentences from Poem Two
  1. Run workflow-run Search in textfiles ( Galaxy version 9.5+galaxy2) with the following parameters:
    • param-file “Select lines from”: SoY_Univ_Mag_sent_per_line.txt
    • “Regular Expression”: death
  2. Rename galaxy-pencil your output SoY_Univ_Mag_death.txt.

When you enable the window manager at the top bar, you can click on the eye galaxy-eye symbols of your last two outputs and visualize them side by side in two different windows. Six and seven lines from the poem contain the term, respectively. You could analyze them in detail now to see where they differ. While the first lines are nearly identical, the last ones are completely different in both versions of the poem. An intriguing insight for further analysis. No wonder the poems and their many editions have sparked the interest of many researchers.

If you only analyze those two poems, you might find it easier to do those steps manually. But particularly, if you create a workflow out of this, you can reproduce this process with only a few clicks, saving you considerable work.

Learn how to extract a workflow from the above analysis.

Alternatively, you can make your analysis more complex and extract further differences between the poems automatically to adapt the above analysis. For inspiration, check out the advanced tutorial on Text-Mining.

Conclusion

Congratulations! You just finished your first analysis with Galaxy, well done! The tutorial covered the basic setup of Galaxy and how you can register, log in and upload your material. You are now familiar with terms in Galaxy, like history, dataset, tool, etc. We used several tools, learned to rerun them and how we can see the outputs in different ways. We used various versions of Regular Expressions to rearrange and clean your text. We also reshaped the text to compare it with the diff tool. In the end, we extracted notable sentences for further close reading. The workflow created from this history would look as follows:

Screenshot of Workflow extracted from the Tutorial Introduction to DH.

With all this knowledge in mind, you can now continue with one of our other tutorials or experiment with your own input. Enjoy!

  1. Thanks to Lilli Fortmeier for suggesting this use case!