Data Pipeline Documentation

Semester.ly’s data pipeline provides the infrastructure by which the database is filled with course information. Whether a given University offers an API or an online course catalogue, this pipeline lends developers an easy framework to work within to pull that information and save it in our Django Model format.

General System Workflow

  1. Pull HTML/JSON markup from a catalogue/API

  2. Map the fields of the mark up to the fields of our ingestor (by simply filling a python dictionary).

  3. The ingestor preprocesses the data, validates it, and writes it to JSON.

  4. Load the JSON into the database.

Note

This process happens automatically via Django/Celery Beat Periodict Tasks. You can learn more about these schedule tasks below (Scheduled Tasks).

Steps 1 and 2 are what we call parsing – an operation that is non-generalizable across all Universities. Often a new parser must be written. For more information on this, read Add a School.

Parsing Library Documentation

Base Parser

class parsing.library.base_parser.BaseParser(school, config=None, output_path=None, output_error_path=None, break_on_error=True, break_on_warning=False, skip_duplicates=True, display_progress_bar=False, validate=True, tracker=None)[source]

Bases: object

Abstract base parser for data pipeline parsers.

extractor
Type

parsing.library.extractor.Extractor

ingestor
Type

parsing.library.ingestor.Ingestor

requester
Type

parsing.library.requester.Requester

school

School that parser is for.

Type

str

end()[source]

Finish the parse.

abstract start(**kwargs)[source]

Start the parse.

Parameters

**kwargs – expanded in child parser.

Requester

class parsing.library.requester.Requester[source]

Bases: object

get(url, params='', session=None, cookies=None, headers=None, verify=True, **kwargs)[source]

HTTP GET.

Parameters
  • url (str) – url to query

  • params (dict) – payload dictionary of HTTP params (default None)

  • cookies (None, optional) – Description

  • headers (None, optional) – Description

  • verify (bool, optional) – Description

  • **kwargs – Description

Examples

TODO

http_request(do_http_request, type, parse=True, quiet=True, timeout=60, throttle=<function Requester.<lambda>>)[source]

Perform HTTP request.

Parameters
  • do_http_request – function that returns request object

  • type (str) – GET, POST, HEAD

  • parse (bool, optional) – Specifies if return should be parsed. Autodetects parse type as html, xml, or json.

  • quiet (bool, optional) – suppress output if True (default True)

  • timeout (int, optional) – Description

  • throttle (lambda, optional) – Description

Returns

if parse is False soup: soupified/jsonified text of http request

Return type

request object

static markup(response)[source]

Autodects html, json, or xml format in response.

Parameters

response – raw response object

Returns

markedup response

new_user_agent()[source]
overwrite_header(new_headers)[source]
post(url, data='', params='', cookies=None, headers=None, verify=True, **kwargs)[source]

HTTP POST.

Parameters
  • url (str) – url to query

  • data (str, optional) – HTTP form key-value dictionary

  • params (dict) – payload dictionary of HTTP params

  • cookies (None, optional) – Description

  • headers (None, optional) – Description

  • verify (bool, optional) – Description

  • **kwargs – Description

Ingestor

exception parsing.library.ingestor.IngestionError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Ingestor error class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception parsing.library.ingestor.IngestionWarning(data, *args)[source]

Bases: parsing.library.exceptions.PipelineWarning

Ingestor warning class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class parsing.library.ingestor.Ingestor(config, output, break_on_error=True, break_on_warning=False, display_progress_bar=True, skip_duplicates=True, validate=True, tracker=<parsing.library.tracker.NullTracker object>)[source]

Bases: dict

Ingest parsing data into formatted json.

Mimics functionality of dict.

ALL_KEYS

Set of keys supported by Ingestor.

Type

set

break_on_error

Break/cont on errors.

Type

bool

break_on_warning

Break/cont on warnings.

Type

bool

school

School code (e.g. jhu, gw, umich).

Type

str

skip_duplicates

Skip ingestion for repeated definitions.

Type

bool

tracker

Tracker object.

Type

library.tracker

UNICODE_WHITESPACE

regex that matches Unicode whitespace.

Type

TYPE

validate

Enable/disable validation.

Type

bool

validator

Validator instance.

Type

library.validator

ALL_KEYS = {'areas', 'author', 'campus', 'capacity', 'code', 'coreqs', 'corequisites', 'cores', 'cost', 'course', 'course_code', 'course_name', 'course_section_id', 'credits', 'date', 'date_end', 'date_start', 'dates', 'day', 'days', 'department', 'department_code', 'department_name', 'dept', 'dept_code', 'dept_name', 'descr', 'description', 'detail_url', 'end_time', 'enrollment', 'enrolment', 'exclusions', 'fee', 'fees', 'final_exam', 'geneds', 'homepage', 'image_url', 'instr', 'instr_name', 'instr_names', 'instrs', 'instructor', 'instructor_name', 'instructors', 'isbn', 'kind', 'level', 'loc', 'location', 'meeting_section', 'meetings', 'name', 'num_credits', 'offerings', 'pos', 'prereqs', 'prerequisites', 'remaining_seats', 'required', 'same_as', 'school', 'school_subdivision_code', 'school_subdivision_name', 'score', 'section', 'section_code', 'section_name', 'section_type', 'sections', 'semester', 'size', 'start_time', 'sub_school', 'summary', 'term', 'textbooks', 'time', 'time_end', 'time_start', 'title', 'type', 'waitlist', 'waitlist_size', 'website', 'where', 'writing_intensive', 'year'}
clear()None.  Remove all items from D.
copy()a shallow copy of D
end()[source]

Finish ingesting.

Close i/o, clear internal state, write meta info

fromkeys(value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

ingest_course()[source]

Create course json from info in model map.

Returns

course

Return type

dict

ingest_eval()[source]

Create evaluation json object.

Returns

eval

Return type

dict

ingest_meeting(section, clean_only=False)[source]

Create meeting ingested json map.

Parameters

section (dict) – validated section object

Returns

meeting

Return type

dict

ingest_section(course)[source]

Create section json object from info in model map.

Parameters

course (dict) – validated course object

Returns

section

Return type

dict

ingest_textbook()[source]

Create textbook json object.

Returns

textbook

Return type

dict

Create textbook link json object.

Parameters

section (None, dict, optional) – Description

Returns

textbook link.

Return type

dict

items()a set-like object providing a view on D’s items
keys()a set-like object providing a view on D’s keys
pop(k[, d])v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F)None.  Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()an object providing a view on D’s values

Validator

exception parsing.library.validator.MultipleDefinitionsWarning(data, *args)[source]

Bases: parsing.library.validator.ValidationWarning

Duplicated key in data definition.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception parsing.library.validator.ValidationError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Validator error class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception parsing.library.validator.ValidationWarning(data, *args)[source]

Bases: parsing.library.exceptions.PipelineWarning

Validator warning class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class parsing.library.validator.Validator(config, tracker=None, relative=True)[source]

Bases: object

Validation engine in parsing data pipeline.

config

Loaded config.json.

Type

DotDict

course_code_regex

Regex to match course code.

Type

re

kind_to_validation_function

Map kind to validation function defined within this class.

Type

dict

KINDS

Kinds of objects that validator validates.

Type

set

relative

Enforce relative ordering in validation.

Type

bool

seen

Running monitor of seen courses and sections

Type

dict

tracker
Type

parsing.library.tracker.Tracker

KINDS = {'config', 'course', 'datalist', 'directory', 'eval', 'final_exam', 'instructor', 'meeting', 'section', 'textbook', 'textbook_link'}
static file_to_json(path, allow_duplicates=False)[source]

Load file pointed to by path into json object dictionary.

Parameters
  • path (str) –

  • allow_duplicates (bool, optional) – Allow duplicate keys in JSON.

Returns

JSON-compliant dictionary.

Return type

dict

classmethod load_schemas(schema_path=None)[source]

Load JSON validation schemas.

NOTE: Will load schemas as static variable (i.e. once per definition),

unless schema_path is specifically defined.

Parameters

schema_path (None, str, optional) – Override default schema_path

static schema_validate(data, schema, resolver=None)[source]

Validate data object with JSON schema alone.

Parameters
  • data (dict) – Data object to validate.

  • schema – JSON schema to validate against.

  • resolver (None, optional) – JSON Schema reference resolution.

Raises

jsonschema.exceptions.ValidationError – Invalid object.

validate(data, transact=True)[source]

Validation entry/dispatcher.

Parameters

data (list, dict) – Data to validate.

validate_course(course)[source]

Validate course.

Parameters

course (DotDict) – Course object to validate.

Raises
validate_directory(directory)[source]

Validate directory.

Parameters

directory (str, dict) – Directory to validate. May be either path or object.

Raises

ValidationError – encapsulated IOError

validate_eval(course_eval)[source]

Validate evaluation object.

Parameters

course_eval (DotDict) – Evaluation to validate.

Raises

ValidationError – Invalid evaulation.

validate_final_exam(final_exam)[source]

Validate final exam.

NOTE: currently unused.

Parameters

final_exam (DotDict) – Final Exam object to validate.

Raises

ValidationError – Invalid final exam.

validate_instructor(instructor)[source]

Validate instructor object.

Parameters

instructor (DotDict) – Instructor object to validate.

Raises

ValidationError – Invalid instructor.

validate_location(location)[source]

Validate location.

Parameters

location (DotDict) – Location object to validate.

Raises

ValidationWarning – Invalid location.

validate_meeting(meeting)[source]

Validate meeting object.

Parameters

meeting (DotDict) – Meeting object to validate.

Raises
validate_section(section)[source]

Validate section object.

Parameters

section (DotDict) – Section object to validate.

Raises
validate_self_contained(data_path, break_on_error=True, break_on_warning=False, output_error=None, display_progress_bar=True, master_log_path=None)[source]

Validate JSON file as without ingestor.

Parameters
  • data_path (str) – Path to data file.

  • break_on_error (bool, optional) – Description

  • break_on_warning (bool, optional) – Description

  • output_error (None, optional) – Error output file path.

  • display_progress_bar (bool, optional) – Description

  • master_log_path (None, optional) – Description

  • break_on_error

  • break_on_warning

  • display_progress_bar

Raises

ValidationError – Description

Validate textbook link.

Parameters

textbook_link (DotDict) – Textbook link object to validate.

Raises

ValidationError – Invalid textbook link.

validate_time_range(start, end)[source]

Validate start time and end time.

There exists an unhandled case if the end time is midnight.

Parameters
  • start (str) – Start time.

  • end (str) – End time.

Raises

ValidationError – Time range is invalid.

static validate_website(url)[source]

Validate url by sending HEAD request and analyzing response.

Parameters

url (str) – URL to validate.

Raises

ValidationError – URL is invalid.

Logger

class parsing.library.logger.JSONColoredFormatter(fmt=None, datefmt=None, style='%', validate=True)[source]

Bases: logging.Formatter

converter()
localtime([seconds]) -> (tm_year,tm_mon,tm_mday,tm_hour,tm_min,

tm_sec,tm_wday,tm_yday,tm_isdst)

Convert seconds since the Epoch to a time tuple expressing local time. When ‘seconds’ is not passed in, convert the current time instead.

default_msec_format = '%s,%03d'
default_time_format = '%Y-%m-%d %H:%M:%S'
format(record)[source]

Format the specified record as text.

The record’s attribute dictionary is used as the operand to a string formatting operation which yields the returned string. Before formatting the dictionary, a couple of preparatory steps are carried out. The message attribute of the record is computed using LogRecord.getMessage(). If the formatting string uses the time (as determined by a call to usesTime(), formatTime() is called to format the event time. If there is exception information, it is formatted using formatException() and appended to the message.

formatException(ei)

Format and return the specified exception information as a string.

This default implementation just uses traceback.print_exception()

formatMessage(record)
formatStack(stack_info)

This method is provided as an extension point for specialized formatting of stack information.

The input data is a string as returned from a call to traceback.print_stack(), but with the last trailing newline removed.

The base implementation just returns the value passed in.

formatTime(record, datefmt=None)

Return the creation time of the specified LogRecord as formatted text.

This method should be called from format() by a formatter which wants to make use of a formatted time. This method can be overridden in formatters to provide for any specific requirement, but the basic behaviour is as follows: if datefmt (a string) is specified, it is used with time.strftime() to format the creation time of the record. Otherwise, an ISO8601-like (or RFC 3339-like) format is used. The resulting string is returned. This function uses a user-configurable function to convert the creation time to a tuple. By default, time.localtime() is used; to change this for a particular formatter instance, set the ‘converter’ attribute to a function with the same signature as time.localtime() or time.gmtime(). To change it for all formatters, for example if you want all logging times to be shown in GMT, set the ‘converter’ attribute in the Formatter class.

usesTime()

Check if the format uses the creation time of the record.

class parsing.library.logger.JSONFormatter(fmt=None, datefmt=None, style='%', validate=True)[source]

Bases: logging.Formatter

Simple JSON extension of Python logging.Formatter.

converter()
localtime([seconds]) -> (tm_year,tm_mon,tm_mday,tm_hour,tm_min,

tm_sec,tm_wday,tm_yday,tm_isdst)

Convert seconds since the Epoch to a time tuple expressing local time. When ‘seconds’ is not passed in, convert the current time instead.

default_msec_format = '%s,%03d'
default_time_format = '%Y-%m-%d %H:%M:%S'
format(record)[source]

Format record message.

Parameters

record (logging.LogRecord) – Description

Returns

Prettified JSON string.

Return type

str

formatException(ei)

Format and return the specified exception information as a string.

This default implementation just uses traceback.print_exception()

formatMessage(record)
formatStack(stack_info)

This method is provided as an extension point for specialized formatting of stack information.

The input data is a string as returned from a call to traceback.print_stack(), but with the last trailing newline removed.

The base implementation just returns the value passed in.

formatTime(record, datefmt=None)

Return the creation time of the specified LogRecord as formatted text.

This method should be called from format() by a formatter which wants to make use of a formatted time. This method can be overridden in formatters to provide for any specific requirement, but the basic behaviour is as follows: if datefmt (a string) is specified, it is used with time.strftime() to format the creation time of the record. Otherwise, an ISO8601-like (or RFC 3339-like) format is used. The resulting string is returned. This function uses a user-configurable function to convert the creation time to a tuple. By default, time.localtime() is used; to change this for a particular formatter instance, set the ‘converter’ attribute to a function with the same signature as time.localtime() or time.gmtime(). To change it for all formatters, for example if you want all logging times to be shown in GMT, set the ‘converter’ attribute in the Formatter class.

usesTime()

Check if the format uses the creation time of the record.

class parsing.library.logger.JSONStreamWriter(obj, type_=<class 'list'>, level=0)[source]

Bases: object

Context to stream JSON list to file.

BRACES

Open close brace definitions.

Type

TYPE

file

Current object being JSONified and streamed.

Type

dict

first

Indicator if first write has been done by streamer.

Type

bool

level

Nesting level of streamer.

Type

int

type_

Actual type class of streamer (dict or list).

Type

dict, list

Examples

>>> with JSONStreamWriter(sys.stdout, type_=dict) as streamer:
...     streamer.write('a', 1)
...     streamer.write('b', 2)
...     streamer.write('c', 3)
{
    "a": 1,
    "b": 2,
    "c": 3
}
>>> with JSONStreamWriter(sys.stdout, type_=dict) as streamer:
...     streamer.write('a', 1)
...     with streamer.write('data', type_=list) as streamer2:
...         streamer2.write({0:0, 1:1, 2:2})
...         streamer2.write({3:3, 4:'4'})
...     streamer.write('b', 2)
{
    "a": 1,
    "data":
    [
        {
            0: 0,
            1: 1,
            2: 2
        },
        {
            3: 3,
            4: "4"
        }
    ],
    "b": 2
}
BRACES = {<class 'list'>: ('[', ']'), <class 'dict'>: ('{', '}')}
enter()[source]

Wrapper for self.__enter__.

exit()[source]

Wrapper for self.__exit__.

write(*args, **kwargs)[source]

Write to JSON in streaming fasion.

Picks either write_obj or write_key_value

Parameters
  • *args – pass-through

  • **kwargs – pass-through

Returns

return value of appropriate write function.

Raises

ValueErrortype_ is not of type list or dict.

write_key_value(key, value=None, type_=<class 'list'>)[source]

Write key, value pair as string to file.

If value is not given, returns new list streamer.

Parameters
  • key (str) – Description

  • value (str, dict, None, optional) – Description

  • type (str, optional) – Description

Returns

None if value is given, else new JSONStreamWriter

write_obj(obj)[source]

Write obj as JSON to file.

Parameters

obj (dict) – Serializable obj to write to file.

parsing.library.logger.colored_json(j)[source]

Tracker

class parsing.library.tracker.NullTracker(*args, **kwargs)[source]

Bases: parsing.library.tracker.Tracker

Dummy tracker used as an interface placeholder.

BROADCAST_TYPES = {'DEPARTMENT', 'INSTRUCTOR', 'MODE', 'SCHOOL', 'STATS', 'TERM', 'TIME', 'YEAR'}
add_viewer(viewer, name=None)

Add viewer to broadcast queue.

Parameters
  • viewer (Viewer) – Viewer to add.

  • name (None, str, optional) – Name the viewer.

broadcast(broadcast_type)[source]

Do nothing.

property department
end()

End tracker and report to viewers.

get_viewer(name)

Get viewer by name.

Will return arbitrary match if multiple viewers with same name exist.

Parameters

name (str) – Viewer name to get.

Returns

Viewer instance if found, else None

Return type

Viewer

has_viewer(name)

Determine if name exists in viewers.

Parameters

name (str) – The name to check against.

Returns

True if name in viewers else False

Return type

bool

property instructor
property mode
remove_viewer(name)

Remove all viewers that match name.

Parameters

name (str) – Viewer name to remove.

report()[source]

Do nothing.

property school
start()

Start timer of tracker object.

property stats
property term
property time
property year
class parsing.library.tracker.Tracker[source]

Bases: object

Tracks specified attributes and broadcasts to viewers.

@property attributes are defined for all BROADCAST_TYPES

BROADCAST_TYPES = {'DEPARTMENT', 'INSTRUCTOR', 'MODE', 'SCHOOL', 'STATS', 'TERM', 'TIME', 'YEAR'}
add_viewer(viewer, name=None)[source]

Add viewer to broadcast queue.

Parameters
  • viewer (Viewer) – Viewer to add.

  • name (None, str, optional) – Name the viewer.

broadcast(broadcast_type)[source]

Broadcast tracker update to viewers.

Parameters

broadcast_type (str) – message to go along broadcast bus.

Raises

TrackerError – if broadcast_type is not in BROADCAST_TYPE.

end()[source]

End tracker and report to viewers.

get_viewer(name)[source]

Get viewer by name.

Will return arbitrary match if multiple viewers with same name exist.

Parameters

name (str) – Viewer name to get.

Returns

Viewer instance if found, else None

Return type

Viewer

has_viewer(name)[source]

Determine if name exists in viewers.

Parameters

name (str) – The name to check against.

Returns

True if name in viewers else False

Return type

bool

remove_viewer(name)[source]

Remove all viewers that match name.

Parameters

name (str) – Viewer name to remove.

report()[source]

Notify viewers that tracker has ended.

start()[source]

Start timer of tracker object.

exception parsing.library.tracker.TrackerError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Tracker error class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

Viewer

class parsing.library.viewer.ETAProgressBar[source]

Bases: parsing.library.viewer.Viewer

receive(tracker, broadcast_type)[source]

Incremental updates of tracking info.

Parameters
  • tracker (Tracker) – Tracker instance.

  • broadcast_type (str) – Broadcast type emitted by tracker.

report(tracker)[source]

Do nothing.

class parsing.library.viewer.Hoarder[source]

Bases: parsing.library.viewer.Viewer

Accumulate a log of some properties of the tracker.

receive(tracker, broadcast_type)[source]

Receive an update from a tracker.

Ignore all broadcasts that are not TIME.

Parameters
report(tracker)[source]

Do nothing.

property schools

Get schools attribute (i.e. self.schools).

Returns

Value of schools storage value.

Return type

dict

class parsing.library.viewer.StatProgressBar(stat_format='', statistics=None)[source]

Bases: parsing.library.viewer.Viewer

Command line progress bar viewer for data pipeline.

SWITCH_SIZE = 100
receive(tracker, broadcast_type)[source]

Incremental update to progress bar.

report(tracker)[source]

Do nothing.

class parsing.library.viewer.StatView[source]

Bases: parsing.library.viewer.Viewer

Keeps view of statistics of objects processed pipeline.

KINDS

The kinds of objects that can be tracked. TODO - move this to a shared space w/Validator

Type

tuple

LABELS

The status labels of objects that can be tracked.

Type

tuple

stats

The view itself of the stats.

Type

dict

KINDS = ('course', 'section', 'meeting', 'textbook', 'evaluation', 'offering', 'textbook_link', 'eval')
LABELS = ('valid', 'created', 'new', 'updated', 'total')
receive(tracker, broadcast_type)[source]

Receive an update from a tracker.

Ignore all broadcasts that are not STATUS.

Parameters
report(tracker=None)[source]

Dump stats.

class parsing.library.viewer.TimeDistributionView[source]

Bases: parsing.library.viewer.Viewer

Viewer to analyze time distribution.

Calculates granularity and holds report and 12, 24hr distribution.

distribution

Contains counts of 12 and 24hr sightings.

Type

dict

granularity

Time granularity of viewed times.

Type

int

receive(tracker, broadcast_type)[source]

Receive an update from a tracker.

Ignore all broadcasts that are not TIME.

Parameters
report(tracker)[source]

Do nothing.

class parsing.library.viewer.Timer(format='%(elapsed)s', **kwargs)[source]

Bases: progressbar.widgets.FormatLabel, progressbar.widgets.TimeSensitiveWidgetBase

Custom timer created to take away ‘Elapsed Time’ string.

INTERVAL = datetime.timedelta(microseconds=100000)
check_size(progress)
mapping = {'elapsed': ('total_seconds_elapsed', <function format_time>), 'finished': ('end_time', None), 'last_update': ('last_update_time', None), 'max': ('max_value', None), 'seconds': ('seconds_elapsed', None), 'start': ('start_time', None), 'value': ('value', None)}
required_values = []
class parsing.library.viewer.Viewer[source]

Bases: object

A view that is updated via a tracker object broadcast or report.

abstract receive(tracker, broadcast_type)[source]

Incremental updates of tracking info.

Parameters
  • tracker (Tracker) – Tracker instance.

  • broadcast_type (str) – Broadcast type emitted by tracker.

abstract report(tracker)[source]

Report all tracked info.

Parameters

tracker (Tracker) – Tracker instance.

exception parsing.library.viewer.ViewerError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Viewer error class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

Digestor

class parsing.library.digestor.Absorb(school, meta)[source]

Bases: parsing.library.digestor.DigestionStrategy

Load valid data into Django db.

meta

Meta-information to use for DataUpdate object

Type

dict

school
Type

str

classmethod digest_section(parmams, clean=True)[source]
static remove_offerings(section_obj)[source]

Remove all offerings associated with a section.

Parameters

section_obj (Section) – Description

static remove_section(section_code, course_obj)[source]

Remove section specified from database.

Parameters
  • section (dict) – Description

  • course_obj (Course) – Section part of this course.

wrap_up()[source]

Update time updated for school at wrap_up of parse.

class parsing.library.digestor.Burp(school, meta, output=None)[source]

Bases: parsing.library.digestor.DigestionStrategy

Load valid data into Django db and output diff between input and db data.

absorb

Digestion strategy.

Type

Vommit

vommit

Digestion strategy.

Type

Absorb

wrap_up()[source]

Do whatever needs to be done to wrap_up digestion session.

class parsing.library.digestor.DigestionAdapter(school, cached, short_course_weeks_limit)[source]

Bases: object

Converts JSON defititions to model compliant dictionay.

cache

Caches Django objects to avoid redundant queries.

Type

dict

school

School code.

Type

str

adapt_course(course)[source]

Adapt course for digestion.

Parameters

course (dict) – course info

Returns

Adapted course for django object.

Return type

dict

Raises

DigestionError – course is None

adapt_evaluation(evaluation)[source]

Adapt evaluation to model dictionary.

Parameters

evaluation (dict) – validated evaluation.

Returns

Description

Return type

dict

adapt_meeting(meeting, section_model=None)[source]

Adapt meeting to Django model.

Parameters
  • meeting (TYPE) – Description

  • section_model (None, optional) – Description

Yields

dict

Raises

DigestionError – meeting is None.

adapt_section(section, course_model=None)[source]

Adapt section to Django model.

Parameters
  • section (TYPE) – Description

  • course_model (None, optional) – Description

Returns

formatted section dictionary

Return type

dict

Raises

DigestionError – Description

adapt_textbook(textbook)[source]

Adapt textbook to model dictionary.

Parameters

textbook (dict) – validated textbook.

Returns

Description

Return type

dict

Adapt textbook link to model dictionary.

Parameters
  • textbook_link (dict) – validated

  • textbook_model (model, None, optional) –

  • section_model (model, None, optional) –

Yields

dict – model compliant

exception parsing.library.digestor.DigestionError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Digestor error class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class parsing.library.digestor.DigestionStrategy[source]

Bases: object

abstract wrap_up()[source]

Do whatever needs to be done to wrap_up digestion session.

class parsing.library.digestor.Digestor(school, meta, tracker=<parsing.library.tracker.NullTracker object>)[source]

Bases: object

Digestor in data pipeline.

adapter

Adapts

Type

DigestionAdapter

cache

Caches recently used Django objects to be used as foriegn keys.

Type

dict

data

The data to be digested.

Type

TYPE

meta

meta data associated with input data.

Type

dict

MODELS

mapping from object type to Django model class.

Type

dict

school

School to digest.

Type

str

strategy

Load and/or diff db depending on strategy

Type

DigestionStrategy

tracker

Description

Type

parsing.library.tracker.Tracker

MODELS = {'course': <class 'timetable.models.Course'>, 'evaluation': <class 'timetable.models.Evaluation'>, 'offering': <class 'timetable.models.Offering'>, 'section': <class 'timetable.models.Section'>, 'semester': <class 'timetable.models.Semester'>, 'textbook': <class 'timetable.models.Textbook'>, 'textbook_link': <class 'timetable.models.TextbookLink'>}
digest(data, diff=True, load=True, output=None)[source]

Digest data.

digest_course(course)[source]

Create course in database from info in json model.

Returns

django course model object

digest_eval(evaluation)[source]

Digest evaluation.

Parameters

evaluation (dict) –

digest_meeting(meeting, section_model=None)[source]

Create offering in database from info in model map.

Parameters

section_model – JSON course model object

Return: Offerings as generator

digest_section(section, course_model=None)[source]

Create section in database from info in model map.

Parameters

course_model – django course model object

Keyword Arguments

clean (boolean) – removes course offerings associated with section if set

Returns

django section model object

digest_textbook(textbook)[source]

Digest textbook.

Parameters

textbook (dict) –

Digest textbook link.

Parameters
wrap_up()[source]
class parsing.library.digestor.Vommit(output)[source]

Bases: parsing.library.digestor.DigestionStrategy

Output diff between input and db data.

diff(kind, inmodel, dbmodel, hide_defaults=True)[source]

Create a diff between input and existing model.

Parameters
  • kind (str) – kind of object to diff.

  • inmodel (model) – Description

  • dbmodel (model) – Description

  • hide_defaults (bool, optional) – hide values that are defaulted into db

Returns

Diff

Return type

dict

static get_model_defaults()[source]
remove_defaulted_keys(kind, dct)[source]
wrap_up()[source]

Do whatever needs to be done to wrap_up digestion session.

Exceptions

exception parsing.library.exceptions.ParseError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineError

Parser error class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception parsing.library.exceptions.ParseJump(data, *args)[source]

Bases: parsing.library.exceptions.PipelineWarning

Parser exception used for control flow.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception parsing.library.exceptions.ParseWarning(data, *args)[source]

Bases: parsing.library.exceptions.PipelineWarning

Parser warning class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception parsing.library.exceptions.PipelineError(data, *args)[source]

Bases: parsing.library.exceptions.PipelineException

Data-pipeline error class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception parsing.library.exceptions.PipelineException(data, *args)[source]

Bases: Exception

Data-pipeline exception class.

Should never be constructed directly. Use:
  • PipelineError

  • PipelineWarning

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception parsing.library.exceptions.PipelineWarning(data, *args)[source]

Bases: parsing.library.exceptions.PipelineException, UserWarning

Data-pipeline warning class.

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

Extractor

class parsing.library.extractor.Extraction(key, container, patterns)

Bases: tuple

container

Alias for field number 1

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 0

patterns

Alias for field number 2

parsing.library.extractor.extract_info_from_text(text, inject=None, extractions=None, use_lowercase=True, splice_text=True)[source]

Attempt to extract info from text and put it into course object.

NOTE: Currently unstable and unused as it introduces too many bugs.

Might reconsider for later use.

Parameters
  • text (str) – text to attempt to extract information from

  • extractions (None, optional) – Description

  • inject (None, optional) – Description

  • use_lowercase (bool, optional) – Description

Returns

the text trimmed of extracted information

Return type

str

Utils

class parsing.library.utils.DotDict(dct)[source]

Bases: dict

Dot notation access for dictionary.

Supports set, get, and delete.

Examples

>>> d = DotDict({'a': 1, 'b': 2, 'c': {'ca': 31}})
>>> d.a, d.b
(1, 2)
>>> d['a']
1
>>> d['a'] = 3
>>> d.a, d['b']
(3, 2)
>>> d.c.ca, d.c['ca']
(31, 31)
as_dict()[source]

Return pure dictionary representation of self.

clear()None.  Remove all items from D.
copy()a shallow copy of D
fromkeys(value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

items()a set-like object providing a view on D’s items
keys()a set-like object providing a view on D’s keys
pop(k[, d])v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F)None.  Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()an object providing a view on D’s values
class parsing.library.utils.SimpleNamespace(**kwargs)[source]

Bases: object

parsing.library.utils.clean(dirt)[source]

Recursively clean json-like object.

list::
  • remove None elements

  • None on empty list

dict::
  • filter out None valued key, value pairs

  • None on empty dict

str::
  • convert unicode whitespace to ascii

  • strip extra whitespace

  • None on empty string

Parameters

dirt – the object to clean

Returns

Cleaned dict, cleaned list, cleaned string, or pass-through.

parsing.library.utils.dict_filter_by_dict(a, b)[source]

Filter dictionary a by b.

dict or set Items or keys must be string or regex. Filters at arbitrary depth with regex matching.

Parameters
  • a (dict) – Dictionary to filter.

  • b (dict) – Dictionary to filter by.

Returns

Filtered dictionary

Return type

dict

parsing.library.utils.dict_filter_by_list(a, b)[source]
parsing.library.utils.dir_to_dict(path)[source]

Recursively create nested dictionary representing directory contents.

Parameters

path (str) – The path of the directory.

Returns

Dictionary representation of the directory.

Return type

dict

parsing.library.utils.is_short_course(date_start, date_end, short_course_weeks_limit)[source]
Checks whether a course’s duration is longer than a short term

course week limit or not. Limit is defined in the config file for the corresponding school.

Parameters
  • {str} -- Any reasonable date value for start date (date_start) –

  • {str} -- Any reasonable date value for end date (date_end) –

  • {int} -- Number of weeks a course can be (short_course_weeks_limit) –

  • as "short term". (defined) –

Raises
Returns

bool – Defines whether the course is short term or not.

parsing.library.utils.iterrify(x)[source]

Create iterable object if not already.

Will wrap str types in extra iterable eventhough str is iterable.

Examples

>>> for i in iterrify(1):
...     print(i)
1
>>> for i in iterrify([1]):
...     print(i)
1
>>> for i in iterrify('hello'):
...     print(i)
'hello'
parsing.library.utils.make_list(x=None)[source]

Wrap in list if not list already.

If input is None, will return empty list.

Parameters

x – Input.

Returns

Input wrapped in list.

Return type

list

parsing.library.utils.pretty_json(obj)[source]

Prettify object as JSON.

Parameters

obj (dict) – Serializable object to JSONify.

Returns

Prettified JSON.

Return type

str

parsing.library.utils.safe_cast(val, to_type, default=None)[source]

Attempt to cast to specified type or return default.

Parameters
  • val – Value to cast.

  • to_type – Type to cast to.

  • default (None, optional) – Description

Returns

Description

Return type

to_type

parsing.library.utils.short_date(date)[source]

Convert input to %m-%d-%y format. Returns None if input is None.

Parameters

date (str) – date in reasonable format

Returns

Date in format %m-%d-%y if the input is not None.

Return type

str

Raises

ParseError – Unparseable time input.

parsing.library.utils.time24(time)[source]

Convert time to 24hr format.

Parameters

time (str) – time in reasonable format

Returns

24hr time in format hh:mm

Return type

str

Raises

ParseError – Unparseable time input.

parsing.library.utils.titlize(name)[source]

Format name into pretty title.

Will uppercase roman numerals. Will lowercase conjuctions and prepositions.

Examples

>>> titlize('BIOLOGY OF CANINES II')
Biology of Canines II
parsing.library.utils.update(d, u)[source]

Recursive update to dictionary w/o overwriting upper levels.

Examples

>>> update({0: {1: 2, 3: 4}}, {1: 2, 0: {5: 6, 3: 7}})
{0: {1: 2}}

Parsing Models Documentation

class parsing.models.DataUpdate(*args, **kwargs)[source]

Stores the date/time that the school’s data was last updated.

Scheduled updates occur when digestion into the database completes.

school

the school code that was updated (e.g. jhu)

Type

CharField

semester

the semester for the update

Type

ForeignKey to Semester

last_updated

the datetime last updated

Type

DateTimeField

reason

the reason it was updated (default Scheduled Update)

Type

CharField

update_type

which field was updated

Type

CharField

UPDATE_TYPE

Update types allowed.

Type

tuple of tuple

COURSES

Update type.

Type

str

EVALUATIONS

Update type.

Type

str

MISCELLANEOUS

Update type.

Type

str

TEXTBOOKS

Update type.

Type

str

exception DoesNotExist
exception MultipleObjectsReturned

Scheduled Tasks