Extracting Regulatory Citations from Textual Content: A Comparison of Regular Expression, Spacy, and a Combination of Both Approaches
Join the DZone community and get the full member experience.
Join For Free
Regulatory citations play a crucial role in legal and compliance-related domains, as they are used to indicate the specific regulations or laws that govern certain actions or behaviors. However, the process of extracting these citations from textual content is a non-trivial task, as the citations may appear in a variety of different formats and may be written in a way that makes them difficult to identify automatically. In this blog post, we will explore three different approaches to extracting regulatory citations from textual content that can be found in a legal document of an Enforcement Action: regular expressions, the spacy NLP library, and a combination of both approaches.
import re
text = “The Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., is a federal law that governs electronic fund transfers.”
# Regular expression pattern for regulatory citations
pattern = re.compile(r”bd{1,2}s[A-Z].?F.?R.?b”)
# Extract regulatory citations from the text
regulatory_citations = re.findall(pattern, text)
# Print the extracted regulatory citations
print(“Regulatory Citations:”, regulatory_citations)
Output: [’15 U.S.C. § 1693′, ’12 C.F.R. pt. 1005′]
import spacy
nlp = spacy.load(“en_core_web_sm”)
text = “The Electronic Fund Transfer Act (EFTA), 15 U.S.C. § 1693 et seq., is a federal law that governs electronic fund transfers.”
# Process the text with spacy
doc = nlp(text)
# Extract regulatory citations from the text
regulatory_citations = [ent.text for ent in doc.ents if ent.label_ == “LAW”]
# Print the extracted regulatory citations
print(“Regulatory Citations:”, regulatory_citations)
Output: [‘Electronic Fund Transfer Act (EFTA)’, ’15 U.S.C. § 1693 et seq.’, ‘Consumer Financial Protection Act of 2010 (CFPA)’, ’12 C.F.R. pt. 1005′]