Parsing Japanese Enterprise COBOL: Shift-JIS, EBCDIC, and DBCS Challenges
Japanese mainframe COBOL systems present unique parsing challenges: mixed single/double-byte character sets, EBCDIC encoding, and industry-specific extensions for banking and insurance.
Japan runs one of the world's largest installed bases of COBOL systems. Major banks, insurance companies, and government agencies process trillions of yen daily through COBOL programs originally written in the 1970s and 1980s. But parsing Japanese COBOL is fundamentally different from parsing English COBOL — and most analysis tools can't handle it.
The Encoding Challenge
Japanese mainframe COBOL uses multiple character encodings, often within the same source file:
- EBCDIC Katakana — IBM's mainframe character set, different from ASCII EBCDIC
- Shift-JIS — mixed single-byte (ASCII) and double-byte (Kanji) encoding used in comments and literals
- EUC-JP — Extended Unix Code, used in open-system COBOL (GnuCOBOL)
- DBCS (Double-Byte Character Set) — IBM's double-byte encoding for Kanji in COBOL data items
The critical issue: COBOL's column-based format breaks with double-byte characters. A Kanji character occupies two bytes but one display column. A parser that counts bytes instead of display positions will misidentify column boundaries, treating program text as comments or sequence numbers.
Japanese COBOL Language Extensions
IBM Enterprise COBOL for z/OS includes Japanese-specific extensions:
- PIC N — national (Unicode/DBCS) data items:
01 WS-NAME PIC N(10) - USAGE DISPLAY-1 — DBCS display format
- NATIONAL-OF / DISPLAY-OF — character set conversion functions
- SHIFT-IN / SHIFT-OUT — inline encoding switches within string literals
These extensions are not part of the COBOL-85 standard and are missing from most open-source COBOL parsers. Our tree-sitter grammar includes support for IBM Enterprise COBOL extensions, including Japanese-specific constructs.
Industry-Specific Patterns
Japanese enterprise COBOL follows conventions defined by IPA (Information-technology Promotion Agency) and SEC (Software Engineering Center):
- ETSS documentation format — standardized program documentation headers
- ESCR naming conventions — variable and paragraph naming rules for maintainability
- Waterfall documentation structure — basic design, detailed design, test specification, and test report templates
Our parser can extract these structural elements from the AST, making it possible to generate IPA/SEC-compliant documentation automatically from source code.
HULFT and DataSpider Integration
Japanese enterprise systems commonly use HULFT (used by 8,700+ companies) for file transfer and DataSpider Servista for system integration. COBOL programs that interact with these systems have specific patterns:
- HULFT-triggered batch jobs defined in JCL
- File format definitions matching HULFT transfer configurations
- DataSpider API calls through COBOL CALL statements
Parsing both the COBOL programs and their associated JCL jobs reveals the complete data flow through HULFT transfer chains — critical information for modernization planning.